Deep Latent-Variable Models of Natural Language

Size: px
Start display at page:

Download "Deep Latent-Variable Models of Natural Language"

Transcription

1 Deep Latent-Variable of Natural Language Yoon Kim, Sam Wiseman, Alexander Rush Tutorial /153

2 Goals Background 1 Goals Background /153

3 Goals Background 1 Goals Background /153

4 Goals Background Goal of Latent-Variable Modeling Probabilistic models provide a declarative language for specifying prior knowledge and structural relationships in the context of unknown variables. Makes it easy to specify: Known interactions in the data Uncertainty about unknown factors Constraints on model properties 4/153

5 Goals Background Goal of Latent-Variable Modeling Probabilistic models provide a declarative language for specifying prior knowledge and structural relationships in the context of unknown variables. Makes it easy to specify: Known interactions in the data Uncertainty about unknown factors Constraints on model properties 4/153

6 Goals Background Latent-Variable Modeling in NLP Long and rich history of latent-variable models of natural language. Major successes include, among many others: Statistical alignment for translation Document clustering and topic modeling Unsupervised part-of-speech tagging and parsing 5/153

7 Goals Background Goals of Deep Learning Toolbox of methods for learning rich, non-linear data representations through numerical optimization. Makes it easy to fit: Highly-flexible predictive models Transferable feature representations Structurally-aligned network architectures 6/153

8 Goals Background Goals of Deep Learning Toolbox of methods for learning rich, non-linear data representations through numerical optimization. Makes it easy to fit: Highly-flexible predictive models Transferable feature representations Structurally-aligned network architectures 6/153

9 Goals Background Deep Learning in NLP Current dominant paradigm for NLP. Major successes include, among many others: Text classification Neural machine translation NLU Tasks (QA, NLI, etc) 7/153

10 Goals Background Tutorial: Deep Latent-Variable for NLP How should a contemporary ML/NLP researcher reason about latent-variables? What unique challenges come from modeling text with latent variables? What techniques have been explored and shown to be effective in recent papers? We explore these through the lens of variational inference. 8/153

11 Goals Background Tutorial: Deep Latent-Variable for NLP How should a contemporary ML/NLP researcher reason about latent-variables? What unique challenges come from modeling text with latent variables? What techniques have been explored and shown to be effective in recent papers? We explore these through the lens of variational inference. 8/153

12 Tutorial Take-Aways Goals Background 1 A collection of deep latent-variable models for NLP 2 An understanding of a variational objective 3 A toolkit of algorithms for optimization 4 A formal guide to advanced techniques 5 A survey of example applications 6 Code samples and techniques for practical use 9/153

13 Goals Background Tutorial Non-s Not covered (for time, not relevance): Many classical latent-variable approaches. Undirected graphical models such as MRFs Non-likelihood based models such as GANs Sampling-based inference such as MCMC. Details of deep learning architectures. 10/153

14 Goals Background 1 Goals Background /153

15 What are deep networks? Goals Background Deep networks are parameterized non-linear functions; They transform input z into features h using parameters π. Important examples: The multilayer perceptron, h = MLP(z; π) = V σ(w z + b) + a π = {V, W, a, b}, The recurrent neural network, which maps a sequence of inputs z 1:T into a sequence of features h 1:T, h t = RNN(h t 1, z t ; π) = σ(uz t + V h t 1 + b) π = {V, U, b}. 12/153

16 What are deep networks? Goals Background Deep networks are parameterized non-linear functions; They transform input z into features h using parameters π. Important examples: The multilayer perceptron, h = MLP(z; π) = V σ(w z + b) + a π = {V, W, a, b}, The recurrent neural network, which maps a sequence of inputs z 1:T into a sequence of features h 1:T, h t = RNN(h t 1, z t ; π) = σ(uz t + V h t 1 + b) π = {V, U, b}. 12/153

17 Goals Background What are latent variable models? Latent variable models give us a joint distribution p(x, z; θ). x is our observed data z is a collection of latent variables θ are the deterministic parameters of the model, such as the neural network parameters Data consists of N i.i.d samples, N p(x (1:N), z (1:N) ; θ) = p(x (n) z (n) ; θ)p(z (n) ; θ). n=1 13/153

18 Goals Background What are latent variable models? Latent variable models give us a joint distribution p(x, z; θ). x is our observed data z is a collection of latent variables θ are the deterministic parameters of the model, such as the neural network parameters Data consists of N i.i.d samples, N p(x (1:N), z (1:N) ; θ) = p(x (n) z (n) ; θ)p(z (n) ; θ). n=1 13/153

19 Goals Background What are latent variable models? Latent variable models give us a joint distribution p(x, z; θ). x is our observed data z is a collection of latent variables θ are the deterministic parameters of the model, such as the neural network parameters Data consists of N i.i.d samples, N p(x (1:N), z (1:N) ; θ) = p(x (n) z (n) ; θ)p(z (n) ; θ). n=1 13/153

20 Probabilistic Graphical Goals Background A directed PGM shows the conditional independence structure. By chain rule, latent variable model over observations can be represented as, θ z (n) x (n) N N p(x (1:N), z (1:N) ; θ) = p(x (n) z (n) ; θ)p(z (n) ; θ) n=1 Specific models may factor further. 14/153

21 Goals Background Posterior For models p(x, z; θ), we ll be interested in the posterior over latent variables z: Why? p(z x; θ) = p(x, z; θ) p(x; θ). z will often represent interesting information about our data (e.g., the cluster x (n) lives in, how similar x (n) and x (n+1) are). Learning the parameters θ of the model often requires calculating posteriors as a subroutine. Intuition: if I know likely z (n) for x (n), I can learn by maximizing p(x (n) z (n) ; θ). 15/153

22 Goals Background Posterior For models p(x, z; θ), we ll be interested in the posterior over latent variables z: Why? p(z x; θ) = p(x, z; θ) p(x; θ). z will often represent interesting information about our data (e.g., the cluster x (n) lives in, how similar x (n) and x (n+1) are). Learning the parameters θ of the model often requires calculating posteriors as a subroutine. Intuition: if I know likely z (n) for x (n), I can learn by maximizing p(x (n) z (n) ; θ). 15/153

23 Goals Background Problem Statement: Two Views Deep & LV are naturally complementary: Rich function approximators with modular parts. Declarative methods for specifying model constraints. Deep & LV are frustratingly incompatible: Deep networks make posterior inference intractable. Latent variable objectives complicate backpropagation. 16/153

24 Goals Background Problem Statement: Two Views Deep & LV are naturally complementary: Rich function approximators with modular parts. Declarative methods for specifying model constraints. Deep & LV are frustratingly incompatible: Deep networks make posterior inference intractable. Latent variable objectives complicate backpropagation. 16/153

25 1 Discrete Continuous Structured 2 Discrete Continuous Structured /153

26 Discrete Continuous Structured A Language Model Our goal is to model a sentence, x 1... x T. Context: RNN language models are remarkable at this task, x 1:T RNNLM(x 1:T ; θ). Defined as, T T p(x 1:T ) = p(x t x <t ) = softmax(w h t ) xt t=1 t=1 where h t = RNN(h t 1, x t 1 ; θ)... x (n) 1 x (n) T N θ 18/153

27 Discrete Continuous Structured A Language Model Our goal is to model a sentence, x 1... x T. Context: RNN language models are remarkable at this task, x 1:T RNNLM(x 1:T ; θ). Defined as, T T p(x 1:T ) = p(x t x <t ) = softmax(w h t ) xt t=1 t=1 where h t = RNN(h t 1, x t 1 ; θ)... x (n) 1 x (n) T N θ 18/153

28 Discrete Continuous Structured A Collection of Model Archetypes Focus: semi-supervised or unsupervised learning, i.e. don t just learn the probabilities, but the process. Range of choices in selecting z 1 Discrete LVs z (Clustering) 2 Continuous LVs z (Dimensionality reduction) 3 Structured LVs z (Structured learning) 19/153

29 Discrete Continuous Structured A Collection of Model Archetypes Focus: semi-supervised or unsupervised learning, i.e. don t just learn the probabilities, but the process. Range of choices in selecting z 1 Discrete LVs z (Clustering) 2 Continuous LVs z (Dimensionality reduction) 3 Structured LVs z (Structured learning) 19/153

30 1 Discrete Continuous Structured 2 Discrete Continuous Structured /153

31 Model 1: Discrete Clustering Discrete Continuous Structured Process: In an old house in Paris that was covered with vines lived twelve little girls in two straight lines. Cluster 23 Discrete latent variable models induce a clustering over sentences x (n). Example uses: Document/sentence clustering [Willett 1988; Aggarwal and Zhai 2012]. Mixture of expert text generation models [Jacobs et al. 1991; Garmash and Monz 2016; Lee et al. 2016] 21/153

32 Model 1: Discrete Clustering Discrete Continuous Structured Process: In an old house in Paris that was covered with vines lived twelve little girls in two straight lines. Cluster 23 Discrete latent variable models induce a clustering over sentences x (n). Example uses: Document/sentence clustering [Willett 1988; Aggarwal and Zhai 2012]. Mixture of expert text generation models [Jacobs et al. 1991; Garmash and Monz 2016; Lee et al. 2016] 21/153

33 Model 1: Discrete - Mixture of Categoricals Discrete Continuous Structured Generative process: 1 Draw cluster z {1,..., K} from a categorical with param µ. 2 Draw word T words x t from a categorical with word distribution π z. Parameters: θ = {µ K 1, K V stochastic matrix π} Gives rise to the Naive Bayes distribution: T p(x, z; θ) = p(z; µ) p(x z; π) = µ z Cat(x t ; π) = µ z t=1 T t=1 π z,xt 22/153

34 Model 1: Graphical Model View µ Discrete Continuous Structured π z (n)... x (n) 1 x (n) T N N p(x (n), z (n) ; µ, π) = n=1 = N p(z (n) ; µ) p(x (n) z (n) ; π) n=1 N T µ z (n) n=1 t=1 π z (n),x (n) t 23/153

35 Deep Model 1: Discrete - Mixture of RNNs Discrete Continuous Structured Generative process: 1 Draw cluster z {1,..., K} from a categorical. 2 Draw words x 1:T from RNNLM with parameters π z. p(x, z; θ) = µ z RNNLM(x 1:T ; π z ) µ z (n) π... x (n) 1 x (n) T N 24/153

36 Discrete Continuous Structured Difference Between Dependence structure: Mixture of Categoricals: x t independent of other x j given z. Mixture of RNNs: x t fully dependent. Interesting question: how will this affect the learned latent space? Number of parameters: Mixture of Categoricals: K V. Mixture of RNNs: K d 2 + V d with RNN with d hidden dims. 25/153

37 Discrete Continuous Structured Difference Between Dependence structure: Mixture of Categoricals: x t independent of other x j given z. Mixture of RNNs: x t fully dependent. Interesting question: how will this affect the learned latent space? Number of parameters: Mixture of Categoricals: K V. Mixture of RNNs: K d 2 + V d with RNN with d hidden dims. 25/153

38 Posterior Discrete Continuous Structured For both discrete models, can apply Bayes rule: p(z) p(x z) p(z x; θ) = p(x) p(z) p(x z) = K p(z=k) p(x z=k) k=1 For mixture of categoricals, posterior uses word counts under each π k. For mixture of RNNs, posterior requires running RNN over x for each k. 26/153

39 Posterior Discrete Continuous Structured For both discrete models, can apply Bayes rule: p(z) p(x z) p(z x; θ) = p(x) p(z) p(x z) = K p(z=k) p(x z=k) k=1 For mixture of categoricals, posterior uses word counts under each π k. For mixture of RNNs, posterior requires running RNN over x for each k. 26/153

40 Posterior Discrete Continuous Structured For both discrete models, can apply Bayes rule: p(z) p(x z) p(z x; θ) = p(x) p(z) p(x z) = K p(z=k) p(x z=k) k=1 For mixture of categoricals, posterior uses word counts under each π k. For mixture of RNNs, posterior requires running RNN over x for each k. 26/153

41 1 Discrete Continuous Structured 2 Discrete Continuous Structured /153

42 Model 2: Continuous / Dimensionality Reduction Discrete Continuous Structured Process: In an old house in Paris that was covered with vines lived twelve little girls in two straight lines. Find a lower-dimensional, well-behaved continuous representation of a sentence. Latent variables in R d make distance/similarity easy. Examples: Recent work in text generation assumes a latent vector per sentence [Bowman et al. 2016; Yang et al. 2017; Hu et al. 2017]. Certain sentence embeddings (e.g., Skip-Thought vectors [Kiros et al. 2015]) can be interpreted in this way. 28/153

43 Model 2: Continuous / Dimensionality Reduction Discrete Continuous Structured Process: In an old house in Paris that was covered with vines lived twelve little girls in two straight lines. Find a lower-dimensional, well-behaved continuous representation of a sentence. Latent variables in R d make distance/similarity easy. Examples: Recent work in text generation assumes a latent vector per sentence [Bowman et al. 2016; Yang et al. 2017; Hu et al. 2017]. Certain sentence embeddings (e.g., Skip-Thought vectors [Kiros et al. 2015]) can be interpreted in this way. 28/153

44 Model 2: Continuous Mixture Discrete Continuous Structured Generative Process: 1 Draw continuous latent variable z from Normal with param µ. 2 For each t, draw word x t from categorical with param softmax(w z). Parameters: θ = {µ R d, π}, π = {W R V d } Intuition: µ is a global distribution, z captures local word distribution of the sentence. 29/153

45 Graphical Model View µ Discrete Continuous Structured π z (n)... x (n) 1 x (n) T N Gives rise to the joint distribution: N N p(x (n), z (n) ; θ) = p(z (n) ; µ) p(x (n) z (n) ; π) n=1 n=1 30/153

46 Discrete Continuous Structured Deep Model 2: Continuous Mixture of RNNs Generative Process: 1 Draw latent variable z N (µ, I). 2 Draw each token x t from a conditional RNNLM. RNN is also conditioned on latent z, p(x, z; π, µ, I) = p(z; µ, I) p(x z; π) = N (z; µ, I) CRNNLM(x 1:T ; π, z) where T CRNNLM(x 1:T ; π, z) = softmax(w h t ) xt t=1 h t = RNN(h t 1, [x t 1 ; z]; π) 31/153

47 Discrete Continuous Structured Deep Model 2: Continuous Mixture of RNNs Generative Process: 1 Draw latent variable z N (µ, I). 2 Draw each token x t from a conditional RNNLM. RNN is also conditioned on latent z, p(x, z; π, µ, I) = p(z; µ, I) p(x z; π) = N (z; µ, I) CRNNLM(x 1:T ; π, z) where T CRNNLM(x 1:T ; π, z) = softmax(w h t ) xt t=1 h t = RNN(h t 1, [x t 1 ; z]; π) 31/153

48 Discrete Continuous Structured Deep Model 2: Continuous Mixture of RNNs Generative Process: 1 Draw latent variable z N (µ, I). 2 Draw each token x t from a conditional RNNLM. RNN is also conditioned on latent z, p(x, z; π, µ, I) = p(z; µ, I) p(x z; π) = N (z; µ, I) CRNNLM(x 1:T ; π, z) where T CRNNLM(x 1:T ; π, z) = softmax(w h t ) xt t=1 h t = RNN(h t 1, [x t 1 ; z]; π) 31/153

49 Discrete Continuous Structured Graphical Model View µ z (n) π... x (n) 1 x (n) T N 32/153

50 Discrete Continuous Structured Posterior For continuous models, Bayes rule is harder to compute, p(z x; θ) = z p(z; µ) p(x z; π) p(z; µ) p(x z; π) dz Shallow and deep Model 2 variants mirror Model 1 variants exactly, but with continuous z. Integral intractable (in general) for both shallow and deep variants. 33/153

51 Discrete Continuous Structured Posterior For continuous models, Bayes rule is harder to compute, p(z x; θ) = z p(z; µ) p(x z; π) p(z; µ) p(x z; π) dz Shallow and deep Model 2 variants mirror Model 1 variants exactly, but with continuous z. Integral intractable (in general) for both shallow and deep variants. 33/153

52 1 Discrete Continuous Structured 2 Discrete Continuous Structured /153

53 Model 3: Structure Learning Discrete Continuous Structured Process: In an old house in Paris that was covered with vines lived twelve little girls in two straight lines. Structured latent variable models are used to infer unannotated structure: Unsupervised POS tagging [Brown et al. 1992; Merialdo 1994; Smith and Eisner 2005] Unsupervised dependency parsing [Klein and Manning 2004; Headden III et al. 2009] Or when structure is useful for interpreting our data: Segmentation of documents into topical passages [Hearst 1997] Alignment [Vogel et al. 1996] 35/153

54 Model 3: Structured - Hidden Markov Model Discrete Continuous Structured Generative Process: 1 For each t, draw z t {1,..., K} from a categorical with param µ zt 1. 2 Draw observed token x t from categorical with param π zt. Parameters: θ = {K K stochastic matrix µ, K V stochastic matrix π} Gives rise to the joint distribution: p(x, z; θ) = = T T p(z t z t 1 ; µ zt 1 ) p(x t z t ; π zt ) t=1 t=1 T T µ zt 1,z t t=1 t=1 π zt,x t 36/153

55 Model 3: Structured - Hidden Markov Model Discrete Continuous Structured Generative Process: 1 For each t, draw z t {1,..., K} from a categorical with param µ zt 1. 2 Draw observed token x t from categorical with param π zt. Parameters: θ = {K K stochastic matrix µ, K V stochastic matrix π} Gives rise to the joint distribution: p(x, z; θ) = = T T p(z t z t 1 ; µ zt 1 ) p(x t z t ; π zt ) t=1 t=1 T T µ zt 1,z t t=1 t=1 π zt,x t 36/153

56 Graphical Model View µ z 1 z 2 z 3 z 4 Discrete Continuous Structured x 1 x 2 x 3 x 4 π N p(x, z; θ) = = T T p(z t z t 1 ; µ zt 1 ) p(x t z t ; π zt ) t=1 t=1 T T µ zt 1,z t t=1 t=1 π zt,x t 37/153

57 Further Extension: Factorial HMM z 3,1 z 3,2 z 3,3 z 3,4 Discrete Continuous Structured z 2,1 z 2,2 z 2,3 z 2,4 z 1,1 z 1,2 z 1,3 z 1,4 x 1 x 2 x 3 x 4 N L T T p(x, z; θ) = p(z l,t z l,t 1 ) p(x t z 1:L,t ) l=1 t=1 t=1 38/153

58 Deep Model 3: Deep HMM Discrete Continuous Structured Parameterize transition and emission distributions with neural networks (c.f., Tran et al. [2016]) Model transition distribution as p(z t z t 1 ) = softmax(mlp(z t 1 ; µ)) Model emission distribution as p(x t z t ) = softmax(mlp(z t ; π)) Note: K K transition parameters for standard HMM vs. O(K d + d 2 ) for deep version. 39/153

59 Deep Model 3: Deep HMM Discrete Continuous Structured Parameterize transition and emission distributions with neural networks (c.f., Tran et al. [2016]) Model transition distribution as p(z t z t 1 ) = softmax(mlp(z t 1 ; µ)) Model emission distribution as p(x t z t ) = softmax(mlp(z t ; π)) Note: K K transition parameters for standard HMM vs. O(K d + d 2 ) for deep version. 39/153

60 Graphical Model View µ z 1 z 2 z 3 z 4 Discrete Continuous Structured x 1 x 2 x 3 x 4 π N p(x, z; θ) = = T T p(z t z t 1 ; µ zt 1 ) p(x t z t ; π zt ) t=1 t=1 T T µ zt 1,z t t=1 t=1 π zt,x t 40/153

61 Posterior Discrete Continuous Structured For structured models, Bayes rule may tractable, p(z x; θ) = p(z; µ) p(x z; π) z p(z ; µ) p(x z ; π) Unlike previous models, z contains interdependent parts. For both shallow and deep Model 3 variants, it s possible to calculate p(x; θ) exactly, with a dynamic program. For some structured models, like Factorial HMM, the dynamic program may still be intractable. 41/153

62 Posterior Discrete Continuous Structured For structured models, Bayes rule may tractable, p(z x; θ) = p(z; µ) p(x z; π) z p(z ; µ) p(x z ; π) Unlike previous models, z contains interdependent parts. For both shallow and deep Model 3 variants, it s possible to calculate p(x; θ) exactly, with a dynamic program. For some structured models, like Factorial HMM, the dynamic program may still be intractable. 41/153

63 1 Maximum Likelihood ELBO 2 3 Maximum Likelihood ELBO /153

64 1 Maximum Likelihood ELBO 2 3 Maximum Likelihood ELBO /153

65 Maximum Likelihood ELBO Learning with Maximum Likelihood : Find model parameters θ that maximize the likelihood of the data, θ = arg max θ N log p(x (n) ; θ) n=1 44/153

66 Learning Deep Maximum Likelihood ELBO N L(θ) = log p(x (n) ; θ) n=1 x N θ Dominant framework is gradient-based optimization: θ (i) = θ (i 1) + η θ L(θ) θ L(θ) calculated with backpropagation. Tactics: mini-batch based training, adaptive learning rates [Duchi et al. 2011; Kingma and Ba 2015]. 45/153

67 Maximum Likelihood ELBO Learning Deep Latent-Variable : Marginalization Likelihood requires summing out the latent variables, p(x; θ) = p(x, z; θ) (= p(x, z; θ)dz if continuous z) z Z In general, hard to optimize log-likelihood for the training set, N L(θ) = log p(x (n), z; θ) z Z n=1 θ z (n) x (n) N 46/153

68 Maximum Likelihood ELBO Learning Deep Latent-Variable : Marginalization Likelihood requires summing out the latent variables, p(x; θ) = p(x, z; θ) (= p(x, z; θ)dz if continuous z) z Z In general, hard to optimize log-likelihood for the training set, N L(θ) = log p(x (n), z; θ) z Z n=1 θ z (n) x (n) N 46/153

69 1 Maximum Likelihood ELBO 2 3 Maximum Likelihood ELBO /153

70 High-level: decompose objective into lower-bound and gap. Maximum Likelihood ELBO L(θ) GAP(θ, λ) LB(θ, λ) L(θ) = LB(θ, λ) + GAP(θ, λ) for some λ Provides framework for deriving a rich set of optimization algorithms. 48/153

71 Marginal Likelihood: Decomposition Maximum Likelihood ELBO For any 1 distribution q(z x; λ) over z, [ p(x, z; θ) ] L(θ) = E q log + KL[q(z x; λ) p(z x; θ)] q(z x; λ) posterior gap ELBO (evidence lower bound) Since KL is always non-negative, L(θ) ELBO(θ, λ). 1 Technical condition: supp(q(z)) supp(p(z x; θ)) 49/153

72 Evidence Lower Bound: Proof Maximum Likelihood ELBO log p(x; θ) = E q log p(x) (Expectation over z) p(x, z) = E q log (Mult/div by p(z x), combine numerator) p(z x) ( ) p(x, z) q(z x) = E q log (Mult/div by q(z x)) q(z x) p(z x) p(x, z) = E q log q(z x) + E q(z x) q log (Split Log) p(z x) p(x, z; θ) = E q log + KL[q(z x; λ) p(z x; θ)] q(z x; λ) 50/153

73 Evidence Lower Bound: Proof Maximum Likelihood ELBO log p(x; θ) = E q log p(x) (Expectation over z) p(x, z) = E q log (Mult/div by p(z x), combine numerator) p(z x) ( ) p(x, z) q(z x) = E q log (Mult/div by q(z x)) q(z x) p(z x) p(x, z) = E q log q(z x) + E q(z x) q log (Split Log) p(z x) p(x, z; θ) = E q log + KL[q(z x; λ) p(z x; θ)] q(z x; λ) 50/153

74 Evidence Lower Bound: Proof Maximum Likelihood ELBO log p(x; θ) = E q log p(x) (Expectation over z) p(x, z) = E q log (Mult/div by p(z x), combine numerator) p(z x) ( ) p(x, z) q(z x) = E q log (Mult/div by q(z x)) q(z x) p(z x) p(x, z) = E q log q(z x) + E q(z x) q log (Split Log) p(z x) p(x, z; θ) = E q log + KL[q(z x; λ) p(z x; θ)] q(z x; λ) 50/153

75 Evidence Lower Bound: Proof Maximum Likelihood ELBO log p(x; θ) = E q log p(x) (Expectation over z) p(x, z) = E q log (Mult/div by p(z x), combine numerator) p(z x) ( ) p(x, z) q(z x) = E q log (Mult/div by q(z x)) q(z x) p(z x) p(x, z) = E q log q(z x) + E q(z x) q log (Split Log) p(z x) p(x, z; θ) = E q log + KL[q(z x; λ) p(z x; θ)] q(z x; λ) 50/153

76 Evidence Lower Bound: Proof Maximum Likelihood ELBO log p(x; θ) = E q log p(x) (Expectation over z) p(x, z) = E q log (Mult/div by p(z x), combine numerator) p(z x) ( ) p(x, z) q(z x) = E q log (Mult/div by q(z x)) q(z x) p(z x) p(x, z) = E q log q(z x) + E q(z x) q log (Split Log) p(z x) p(x, z; θ) = E q log + KL[q(z x; λ) p(z x; θ)] q(z x; λ) 50/153

77 Evidence Lower Bound over Observations Maximum Likelihood ELBO [ p(x, z; θ) ] ELBO(θ, λ; x) = E q(z) log q(z x; λ) ELBO is a function of the generative model parameters, θ, and the variational parameters, λ. N log p(x (n) ; θ) n=1 = N ELBO(θ, λ; x (n) ) n=1 N n=1 E q(z x (n) ; λ) [ log p(x(n), z; θ) ] q(z x (n) ; λ) = ELBO(θ, λ; x (1:N) ) = ELBO(θ, λ) 51/153

78 Maximum Likelihood ELBO Setup: Selecting Family Just as with p and θ, we can select any form of q and λ that satisfies ELBO conditions. Different choices of q will lead to different algorithms. We will explore several forms of q: Posterior Point Estimate / MAP Amortized Mean Field (later) 52/153

79 Example Family : Full Posterior Form z (n) KL(q(z x) p(z x)) z (n) Maximum Likelihood ELBO λ (n) N x (n) N θ λ = [λ (1),..., λ (N) ] is a concatenation of local variational parameters λ (n), e.g. q(z (n) x (n) ; λ) = q(z (n) x (n) ; λ (n) ) = N (λ (n), 1) 53/153

80 Example Family: Amortized Parameterization [Kingma and Welling 2014] λ z (n) KL(q(z x) p(z x)) z (n) Maximum Likelihood ELBO x (n) N x (n) N θ λ parameterizes a global network (encoder/inference network) that is run over x (n) to produce the local variational distribution, e.g. q(z (n) x (n) ; λ) = N (µ (n), 1), µ (n) = enc(x (n) ; λ) 54/153

81 1 Exact Gradient Sampling Conjugacy Exact Gradient Sampling Conjugacy /153

82 Maximizing the Evidence Lower Bound Exact Gradient Sampling Conjugacy Central quantity of interest: almost all methods are maximizing the ELBO Aggregate ELBO objective, arg max θ,λ arg max ELBO(θ, λ) θ,λ ELBO(θ, λ) = arg max θ,λ = arg max θ,λ N ELBO(θ, λ; x (n) ) n=1 N [ E q log p(x(n), z (n) ; θ) ] q(z (n) x (n) ; λ) n=1 56/153

83 Maximizing the Evidence Lower Bound Exact Gradient Sampling Conjugacy Central quantity of interest: almost all methods are maximizing the ELBO Aggregate ELBO objective, arg max θ,λ arg max ELBO(θ, λ) θ,λ ELBO(θ, λ) = arg max θ,λ = arg max θ,λ N ELBO(θ, λ; x (n) ) n=1 N [ E q log p(x(n), z (n) ; θ) ] q(z (n) x (n) ; λ) n=1 56/153

84 Maximizing ELBO: Model Parameters [ arg max E q log θ p(x, z; θ) ] q(z x; λ) = arg max E q [log p(x, z; θ)] θ Exact Gradient Sampling Conjugacy ELBO θ ELBO Intuition: Maximum likelihood problem under variables drawn from q(z x; λ). 57/153

85 Exact Gradient Sampling Conjugacy Model Estimation: Gradient Ascent on Model Parameters Easy: Gradient respect to θ [ ] θ ELBO(θ, λ; x) = θ E q log p(x, z; θ) ] = E q [ θ log p(x, z; θ) Since q not dependent on θ, moves inside expectation. Estimate with samples from q. Term log p(x, z; θ) is easy to evaluate. (In practice single sample is often sufficient). In special cases, can exactly evaluate expectation. 58/153

86 Exact Gradient Sampling Conjugacy Model Estimation: Gradient Ascent on Model Parameters Easy: Gradient respect to θ [ ] θ ELBO(θ, λ; x) = θ E q log p(x, z; θ) ] = E q [ θ log p(x, z; θ) Since q not dependent on θ, moves inside expectation. Estimate with samples from q. Term log p(x, z; θ) is easy to evaluate. (In practice single sample is often sufficient). In special cases, can exactly evaluate expectation. 58/153

87 Exact Gradient Sampling Conjugacy Maximizing ELBO: Distribution arg max ELBO(θ, λ) λ L = arg max log p(x; θ) KL[q(z x; λ) p(z x; θ)] λ = arg min KL[q(z x; λ) p(z x; θ)] λ λ posterior gap ELBO Intuition: q should approximate the posterior p(z x). However, may be difficult if q or p is a deep model. 59/153

88 Model : Gradient Ascent on λ? Exact Gradient Sampling Conjugacy Hard: Gradient respect to λ λ ELBO(θ, λ; x) = λ E q [ log E q [ λ log p(x, z; θ) ] q(z x; λ) p(x, z; θ) ] q(z x; λ) Cannot naively move inside the expectation, since q depends on λ. This section: in practice: 1 Exact gradient 2 Sampling: score function, reparameterization 3 Conjugacy: closed-form, coordinate ascent 60/153

89 Model : Gradient Ascent on λ? Exact Gradient Sampling Conjugacy Hard: Gradient respect to λ λ ELBO(θ, λ; x) = λ E q [ log E q [ λ log p(x, z; θ) ] q(z x; λ) p(x, z; θ) ] q(z x; λ) Cannot naively move inside the expectation, since q depends on λ. This section: in practice: 1 Exact gradient 2 Sampling: score function, reparameterization 3 Conjugacy: closed-form, coordinate ascent 60/153

90 1 Exact Gradient Sampling Conjugacy Exact Gradient Sampling Conjugacy /153

91 Strategy 1: Exact Gradient Exact Gradient Sampling Conjugacy [ p(x, z; θ) ] λ ELBO(θ, λ; x) = λ E q(z x; λ) log q(z x; λ) ( p(x, z; θ) ) = λ q(z x; λ) log q(z x; λ) Naive enumeration: Linear in Z. z Z Depending on structure of q and p, potentially faster with dynamic programming. Applicable mainly to Model 1 and 3 (Discrete and Structured), or Model 2 with point estimate. 62/153

92 Strategy 1: Exact Gradient Exact Gradient Sampling Conjugacy [ p(x, z; θ) ] λ ELBO(θ, λ; x) = λ E q(z x; λ) log q(z x; λ) ( p(x, z; θ) ) = λ q(z x; λ) log q(z x; λ) Naive enumeration: Linear in Z. z Z Depending on structure of q and p, potentially faster with dynamic programming. Applicable mainly to Model 1 and 3 (Discrete and Structured), or Model 2 with point estimate. 62/153

93 λ z Example: Model 1 - Naive Bayes z Exact Gradient Sampling Conjugacy Let q(z x; λ) = Cat(ν) where ν = enc(x; λ) x x 1 λ ELBO(θ, λ; x) = λ E q(z x; λ) [ log ( = λ q(z x; λ) log z Z ( = λ ν z log z Z... p(x, z; θ) ) ν z x T p(x, z; θ) ] q(z x; λ) p(x, z; θ) ) q(z x; λ) 63/153

94 λ z Example: Model 1 - Naive Bayes z Exact Gradient Sampling Conjugacy Let q(z x; λ) = Cat(ν) where ν = enc(x; λ) x x 1 λ ELBO(θ, λ; x) = λ E q(z x; λ) [ log ( = λ q(z x; λ) log z Z ( = λ ν z log z Z... p(x, z; θ) ) ν z x T p(x, z; θ) ] q(z x; λ) p(x, z; θ) ) q(z x; λ) 63/153

95 1 Exact Gradient Sampling Conjugacy Exact Gradient Sampling Conjugacy /153

96 Strategy 2: Sampling [ log p(x, z; θ) ] λ ELBO(θ, λ; x) = λ E q log log q(z x; λ) [ ] [ ] = λ E q log p(x, z; θ) λ E q log q(z x; θ) Exact Gradient Sampling Conjugacy How can we approximate this gradient with sampling? Naive algorithm fails to provide non-zero gradient. λ 1 J z (1),..., z (J) q(z x; λ) J j=1 [ ] log p(x, z (j) ; θ) = 0 Manipulate expression so we can move λ inside E q before sampling. 65/153

97 Strategy 2: Sampling [ log p(x, z; θ) ] λ ELBO(θ, λ; x) = λ E q log log q(z x; λ) [ ] [ ] = λ E q log p(x, z; θ) λ E q log q(z x; θ) Exact Gradient Sampling Conjugacy How can we approximate this gradient with sampling? Naive algorithm fails to provide non-zero gradient. λ 1 J z (1),..., z (J) q(z x; λ) J j=1 [ ] log p(x, z (j) ; θ) = 0 Manipulate expression so we can move λ inside E q before sampling. 65/153

98 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator First term. Use basic identity: log q = q q Policy-gradient style training [Williams 1992] [ ] λ E q log p(x, z; θ) = z q = q log q λ q(z x; λ) log p(x, z; θ) 66/153

99 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator First term. Use basic identity: log q = q q Policy-gradient style training [Williams 1992] [ ] λ E q log p(x, z; θ) = z q = q log q λ q(z x; λ) log p(x, z; θ) }{{} q log q 67/153

100 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator First term. Use basic identity: log q = q q Policy-gradient style training [Williams 1992] [ ] λ E q log p(x, z; θ) = z q = q log q λ q(z x; λ) log p(x, z; θ) = z q(z x; λ) λ log q(z x; λ) log p(x, z; θ) 68/153

101 Strategy 2a: Sampling Score Function Gradient Estimator Exact Gradient Sampling Conjugacy First term. Use basic identity: log q = q q Policy-gradient style training [Williams 1992] [ ] λ E q log p(x, z; θ) = z q = q log q λ q(z x; λ) log p(x, z; θ) = q(z x; λ) λ log q(z x; λ) log p(x, z; θ) z [ ] = E q log p(x, z; θ) λ log q(z x; λ) 69/153

102 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator Second term. Need additional identity: [ ] λ E q log q(z x; λ) = z q = q = 1 = 0 ) λ (q(z x; λ) log q(z x; λ) 70/153

103 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator Second term. Need additional identity: [ ] λ E q log q(z x; λ) = z ( = z λ q(z x; λ) }{{} q log q q = q = 1 = 0 λ (q(z x; λ) log q(z x; λ) ) ( log q(z x; λ) + q(z x; λ) λ log q(z x; λ) }{{} q q ) 71/153

104 Strategy 2a: Sampling Score Function Gradient Estimator Second term. Need additional identity: q = q = 1 = 0 Exact Gradient Sampling Conjugacy [ ] λ E q log q(z x; λ) = z λ (q(z x; λ) log q(z x; λ) = z log q(z x; λ)q(z x; λ) λ log q(z x; λ) + z λ q(z x; λ) 72/153

105 Strategy 2a: Sampling Score Function Gradient Estimator Second term. Need additional identity: q = q = 1 = 0 Exact Gradient Sampling Conjugacy [ ] λ E q log q(z x; λ) = z λ (q(z x; λ) log q(z x; λ) = z log q(z x; λ)q(z x; λ) λ log q(z x; λ) + λ q(z x; λ) z }{{} = q= 1=0 73/153

106 Strategy 2a: Sampling Score Function Gradient Estimator Second term. Need additional identity: q = q = 1 = 0 Exact Gradient Sampling Conjugacy [ ] λ E q log q(z x; λ) = z λ (q(z x; λ) log q(z x; λ) = z log q(z x; λ)q(z x; λ) λ log q(z x; λ) + z λ q(z x; λ) = E q [log q(z x; λ) λ q(z x; λ)] 74/153

107 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator Putting these together, [ p(x, z; θ) ] λ ELBO(θ, λ; x) = λ E q log q(z x; λ) [ p(x, z; θ) ] = E q log q(z x; λ) λ log q(z x; λ) ] = E q [R θ,λ (z) λ log q(z x; λ) 75/153

108 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator Estimate with samples, z (1),..., z (J) q(z x; λ) ] E q [R θ,λ (z) λ log q(z x; λ) 1 J J R θ,λ (z (j) ) λ log q(z (j) x; λ) j=1 Intuition: if a sample z (j) is has high reward R θ,λ (z (j) ), increase the probability of z (j) by moving along the gradient λ log q(z (j) x; λ). 76/153

109 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator Essentially reinforcement learning with reward R θ,λ (z) Score function gradient is generally applicable regardless of what distribution q takes (only need to evaluate λ log q). This generality comes at a cost, since the reward is black-box : unbiased estimator, but high variance. In practice, need variance-reducing control variate B. (More on this later). 77/153

110 Exact Gradient Sampling Conjugacy λ z x Example: Model 1 - Naive Bayes Let q(z x; λ) = Cat(ν) where ν = enc(x; λ) Sample z (1),..., z (J) q(z x; λ) [ p(x, z; θ) ] λ ELBO(θ, λ; x) = E q log q(z x; λ) λ log q(z x; λ) 1 J J j=1 x 1 z... ν z (j) log p(x, z(j) ; θ) λ log ν ν z (j) z (j) x T Computational complexity: O(J) vs O( Z ) 78/153

111 Exact Gradient Sampling Conjugacy λ z x Example: Model 1 - Naive Bayes Let q(z x; λ) = Cat(ν) where ν = enc(x; λ) Sample z (1),..., z (J) q(z x; λ) [ p(x, z; θ) ] λ ELBO(θ, λ; x) = E q log q(z x; λ) λ log q(z x; λ) 1 J J j=1 x 1 z... ν z (j) log p(x, z(j) ; θ) λ log ν ν z (j) z (j) x T Computational complexity: O(J) vs O( Z ) 78/153

112 Exact Gradient Sampling Conjugacy Strategy 2b: Sampling Reparameterization Suppose we can sample from q by applying a deterministic, differentiable transformation g to a base noise density, ɛ U z = g(ɛ, λ) Gradient calculation (first term): [ ] [ ] λ E z q(z x; λ) log p(x, z; θ) = λ E ɛ U log p(x, g(ɛ, λ); θ) ] = E ɛ U [ λ log p(x, g(ɛ, λ); θ) where 1 J J λ log p(x, g(ɛ (j), λ); θ) j=1 ɛ (1),..., ɛ (J) U 79/153

113 Exact Gradient Sampling Conjugacy Strategy 2b: Sampling Reparameterization Suppose we can sample from q by applying a deterministic, differentiable transformation g to a base noise density, ɛ U z = g(ɛ, λ) Gradient calculation (first term): [ ] [ ] λ E z q(z x; λ) log p(x, z; θ) = λ E ɛ U log p(x, g(ɛ, λ); θ) ] = E ɛ U [ λ log p(x, g(ɛ, λ); θ) where 1 J J λ log p(x, g(ɛ (j), λ); θ) j=1 ɛ (1),..., ɛ (J) U 79/153

114 Exact Gradient Sampling Conjugacy Strategy 2b: Sampling Reparameterization Unbiased, like the score function gradient estimator, but empirically lower variance. In practice, single sample is often sufficient. Cannot be used out-of-the-box for discrete z. 80/153

115 Strategy 2: Continuous Latent Variable RNN ɛ Exact Gradient Sampling Conjugacy λ z x x 1 z... Choose variational family to be an amortized diagonal Gaussian q(z x; λ) = N (µ, σ 2 ) µ, σ 2 = enc(x; λ) Then we can sample from q(z x; λ) by x T ɛ N (0, I) z = µ + σɛ 81/153

116 Strategy 2b: Sampling Reparameterization Exact Gradient Sampling Conjugacy (Recall R θ,λ (z) = log Score function: Reparameterization: p(x,z; θ) q(z x; λ) ) λ ELBO(θ, λ; x) = E z q [R θ,λ (z) λ log q(z x; λ)] λ ELBO(θ, λ; x) = E ɛ N (0,I) [ λ R θ,λ (g(ɛ, λ; x))] where g(ɛ, λ; x) = µ + σɛ. Informally, reparameterization gradients differentiate through R θ,λ ( ) and thus has more knowledge about the structure of the objective function. 82/153

117 1 Exact Gradient Sampling Conjugacy Exact Gradient Sampling Conjugacy /153

118 Exact Gradient Sampling Conjugacy Strategy 3: Conjugacy For certain choices for p and q, we can compute parts of exactly in closed-form. Recall that arg max λ arg max ELBO(θ, λ; x) λ ELBO(θ, λ; x) = arg min KL[q(z x; λ) p(z x; θ)] λ 84/153

119 Exact Gradient Sampling Conjugacy Strategy 3: Conjugacy For certain choices for p and q, we can compute parts of exactly in closed-form. Recall that arg max λ arg max ELBO(θ, λ; x) λ ELBO(θ, λ; x) = arg min KL[q(z x; λ) p(z x; θ)] λ 84/153

120 Strategy 3a: Conjugacy Tractable Posterior Exact Gradient Sampling Conjugacy Suppose we can tractably calculate p(z x; θ). Then KL[q(z x; λ) p(z x; θ)] is minimized when, q(z x; λ) = p(z x; θ) The E-step in Expectation Maximization algorithm [Dempster et al. 1977] posterior gap L λ ELBO 85/153

121 Example: Model 1 - Naive Bayes λ z z x x 1... x T Exact Gradient Sampling Conjugacy p(z x; θ) = p(x, z; θ) K z =1 p(x, z ; θ) So λ is given by the parameters of the categorical distribution, i.e. λ = [p(z = 1 x; θ),..., p(z = K x; θ)] 86/153

122 Example: Model 3 HMM µ z 1 z 2 z 3 z 4 Exact Gradient Sampling Conjugacy x 1 x 2 x 3 x 4 N π T p(x, z; θ) = p(z 0 ) p(z t z t 1 ; µ)p(x t z t ; π) t=1 87/153

123 Exact Gradient Sampling Conjugacy Example: Model 3 HMM Run forward/backward dynamic programming to calculate posterior marginals, p(z t, z t+1 x; θ) variational parameters λ R T K2 store edge marginals. These are enough to calculate q(z; λ) = p(z x; θ) (i.e. the exact posterior) over any sequence z. 88/153

124 Exact Gradient Sampling Conjugacy Example: Model 3 HMM Run forward/backward dynamic programming to calculate posterior marginals, p(z t, z t+1 x; θ) variational parameters λ R T K2 store edge marginals. These are enough to calculate q(z; λ) = p(z x; θ) (i.e. the exact posterior) over any sequence z. 88/153

125 Exact Gradient Sampling Conjugacy Connection: Gradient Ascent on Log Marginal Likelihood Why not perform gradient ascent directly on log marginal likelihood? log p(x; θ) = log p(x, z; θ) z Same as optimizing ELBO with posterior inference (i.e EM). Gradients of model parameters given by (where q(z x; λ) = p(z x; θ)): θ log p(x; θ) = E q(z x; λ) [ θ log p(x, z; θ)] posterior gap L ELBO 89/153

126 Exact Gradient Sampling Conjugacy Connection: Gradient Ascent on Log Marginal Likelihood Why not perform gradient ascent directly on log marginal likelihood? log p(x; θ) = log p(x, z; θ) z Same as optimizing ELBO with posterior inference (i.e EM). Gradients of model parameters given by (where q(z x; λ) = p(z x; θ)): θ log p(x; θ) = E q(z x; λ) [ θ log p(x, z; θ)] posterior gap L ELBO 89/153

127 Exact Gradient Sampling Conjugacy Connection: Gradient Ascent on Log Marginal Likelihood Practically, this means we don t have to manually perform posterior inference in the E-step. Can just calculate log p(x; θ) and call backpropagation. Example: in deep HMM, just implement forward algorithm to calculate log p(x; θ) and backpropagate using autodiff. No need to implement backward algorithm. (Or vice versa). (See Eisner [2016]: Inside-Outside and Forward-Backward Algorithms Are Just Backprop ) 90/153

128 Exact Gradient Sampling Conjugacy Strategy 3b: Conditional Conjugacy Let p(z x; θ) be intractable, but suppose p(x, z; θ) is conditionally conjugate, meaning p(z t x, z t ; θ) is exponential family. Restrict the family of distributions q so that it factorizes over z t, i.e. q(z; λ) = (mean field family) T q(z t ; λ t ) Further choose q(z t ; λ t ) so that it is in the same family as p(z t x, z t ; θ). t=1 91/153

129 Strategy 3b: Conditional Conjugacy z (n) 1 z (n) T KL(q(z) p(z x)) z (n) Exact Gradient Sampling Conjugacy λ (n) 1 λ (n) T N q(z; λ) = T q(z t ; λ t ) t=1 x (n) N θ 92/153

130 Mean Field Family Exact Gradient Sampling Conjugacy Optimize ELBO via coordinate ascent, i.e. iterate for λ 1,..., λ T arg max λ t [ T ] KL q(z t ; λ t ) p(z x; θ) Coordinate ascent updates will take the form ( ) q(z t ; λ t ) exp E q(z t ; λ t )[log p(x, z; θ)] where t=1 E q(z t ; λ t )[log p(x, z; θ)] = j t q(z j ; λ j ) log p(x, z; θ) Since p(z t x, z t ) was assumed to be in the exponential family, above updates can be derived in closed form. j t 93/153

131 Example: Model 3 Factorial HMM z 3,1 z 3,2 z 3,3 z 3,4 Exact Gradient Sampling Conjugacy z 2,1 z 2,2 z 2,3 z 2,4 z 1,1 z 1,2 z 1,3 z 1,4 x 1 x 2 x 3 x 4 N L T p(x, z; θ) = p(z l,t z l,t 1 ; θ)p(x t z l,t ; θ) l=1 t=1 94/153

132 Example: Model 3 Factorial HMM Exact Gradient Sampling Conjugacy z 3,1 z 3,2 z 3,3 z 3,4 z 2,1 z 2,2 z 2,3 z 2,4 z 1,1 z 1,2 z 1,3 z 1,4 x 1 x 2 x 3 x 4 ( ) q(z 1,1 ; λ 1,1 ) exp E q(z (1,1) ; λ (1,1) )[log p(x, z; θ)] N 95/153

133 Example: Model 3 Factorial HMM Exact Gradient Sampling Conjugacy z 3,1 z 3,2 z 3,3 z 3,4 z 2,1 z 2,2 z 2,3 z 2,4 z 1,1 z 1,2 z 1,3 z 1,4 x 1 x 2 x 3 x 4 ( ) q(z 2,1 ; λ 2,1 ) exp E q(z (2,1) ; λ (2,1) )[log p(x, z; θ)] N 96/153

134 Exact : Example: Model 3 Factorial HMM Exact Gradient Sampling Conjugacy Naive: K states, L levels = HMM with K L states = O(T K 2L ) Smarter: O(T LK L+1 ) Mean Field: Gaussian emissions: O(T LK 2 ) [Ghahramani and Jordan 1996]. Categorical emission: need more variational approximations, but ultimately O(LKV T ) [Nepal and Yates 2013]. 97/153

135 Exact : Example: Model 3 Factorial HMM Exact Gradient Sampling Conjugacy Naive: K states, L levels = HMM with K L states = O(T K 2L ) Smarter: O(T LK L+1 ) Mean Field: Gaussian emissions: O(T LK 2 ) [Ghahramani and Jordan 1996]. Categorical emission: need more variational approximations, but ultimately O(LKV T ) [Nepal and Yates 2013]. 97/153

136 Gumbel-Softmax Flows IWAE 5 Gumbel-Softmax Flows IWAE 6 98/153

137 Gumbel-Softmax Flows IWAE 1 Gumbel-Softmax: Extend reparameterization to discrete variables. 2 Flows: Optimize a tighter bound by making the variational family q more flexible. 3 Importance Weighting: Optimize a tighter bound through importance sampling. 99/153

138 Gumbel-Softmax Flows IWAE 5 Gumbel-Softmax Flows IWAE 6 100/153

139 Challenges of Discrete Variables Gumbel-Softmax Flows IWAE Review: we can always use score function estimator [ p(x, z; θ) ] λ ELBO(x, θ, λ) = E q log q(z x; λ) λ log q(z x; λ) [( p(x, z; θ) ) ] = E q log q(z x; λ) B λ log q(z x; λ) E q [B λ log q(z x; λ)] = 0 (since E[ log q] = q log q = q = 0) Control variate B (not dependent on z, but can depend on x). Estimate this quantity with another neural net [Mnih and Gregor 2014] ( p(x, z; θ) ) 2 B(x; ψ) log q(z x; λ) 101/153

140 Challenges of Discrete Variables Gumbel-Softmax Flows IWAE Review: we can always use score function estimator [ p(x, z; θ) ] λ ELBO(x, θ, λ) = E q log q(z x; λ) λ log q(z x; λ) [( p(x, z; θ) ) ] = E q log q(z x; λ) B λ log q(z x; λ) E q [B λ log q(z x; λ)] = 0 (since E[ log q] = q log q = q = 0) Control variate B (not dependent on z, but can depend on x). Estimate this quantity with another neural net [Mnih and Gregor 2014] ( p(x, z; θ) ) 2 B(x; ψ) log q(z x; λ) 101/153

141 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Gumbel-Softmax Flows IWAE The Gumbel-Max trick [Papandreou and Yuille 2011] p(z k = 1; α) = α k K j=1 α j where z = [0, 0,..., 1,..., 0] is a one-hot vector. Can sample from p(z; α) by 1 Drawing independent Gumbel noise ɛ = ɛ 1,..., ɛ K ɛ k = log( log u k ) u k U(0, 1) 2 Adding ɛ k to log α k, finding argmax, i.e. i = arg max[log α k + ɛ k ] z i = 1 k 102/153

142 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Gumbel-Softmax Flows IWAE The Gumbel-Max trick [Papandreou and Yuille 2011] p(z k = 1; α) = α k K j=1 α j where z = [0, 0,..., 1,..., 0] is a one-hot vector. Can sample from p(z; α) by 1 Drawing independent Gumbel noise ɛ = ɛ 1,..., ɛ K ɛ k = log( log u k ) u k U(0, 1) 2 Adding ɛ k to log α k, finding argmax, i.e. i = arg max[log α k + ɛ k ] z i = 1 k 102/153

143 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Gumbel-Softmax Flows IWAE The Gumbel-Max trick [Papandreou and Yuille 2011] p(z k = 1; α) = α k K j=1 α j where z = [0, 0,..., 1,..., 0] is a one-hot vector. Can sample from p(z; α) by 1 Drawing independent Gumbel noise ɛ = ɛ 1,..., ɛ K ɛ k = log( log u k ) u k U(0, 1) 2 Adding ɛ k to log α k, finding argmax, i.e. i = arg max[log α k + ɛ k ] z i = 1 k 102/153

144 Gumbel-Softmax Flows IWAE Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Reparameterization: z = arg max s K 1 (log α + ɛ) s = g(ɛ, α) z = g(ɛ, α) is a deterministic function applied to stochastic noise. Let s try applying this: (Recalling R θ,λ (z) = log q(z k = 1 x; λ) = p(x,z; θ) q(z x; λ) ), α k K j=1 α j α = enc(x; λ) λ E q(z x; λ) [R θ,λ (z)] = λ E ɛ Gumbel [R θ,λ (g(ɛ, α)] = E ɛ Gumbel [ λ R θ,λ (g(ɛ, α))] 103/153

145 Gumbel-Softmax Flows IWAE Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Reparameterization: z = arg max s K 1 (log α + ɛ) s = g(ɛ, α) z = g(ɛ, α) is a deterministic function applied to stochastic noise. Let s try applying this: (Recalling R θ,λ (z) = log q(z k = 1 x; λ) = p(x,z; θ) q(z x; λ) ), α k K j=1 α j α = enc(x; λ) λ E q(z x; λ) [R θ,λ (z)] = λ E ɛ Gumbel [R θ,λ (g(ɛ, α)] = E ɛ Gumbel [ λ R θ,λ (g(ɛ, α))] 103/153

146 Gumbel-Softmax Flows IWAE Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Reparameterization: z = arg max s K 1 (log α + ɛ) s = g(ɛ, α) z = g(ɛ, α) is a deterministic function applied to stochastic noise. Let s try applying this: (Recalling R θ,λ (z) = log q(z k = 1 x; λ) = p(x,z; θ) q(z x; λ) ), α k K j=1 α j α = enc(x; λ) λ E q(z x; λ) [R θ,λ (z)] = λ E ɛ Gumbel [R θ,λ (g(ɛ, α)] = E ɛ Gumbel [ λ R θ,λ (g(ɛ, α))] 103/153

147 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] But this won t work, because zero gradients (almost everywhere) z = g(ɛ, α) = arg max s K 1 (log α + ɛ) s = λ R θ,λ (z) = 0 Gumbel-Softmax Flows IWAE Gumbel-Softmax trick: replace arg max with softmax ( log α + ɛ ) z = softmax τ (where τ is a temperature term.) z k = exp((log α k + ɛ k )/τ) K j=1 exp((log α j + ɛ j )/τ) [ ( λ E q(z x; λ) [R θ,λ (z)] E ɛ Gumbel λ R θ,λ softmax ( log α + ɛ) )] τ 104/153

148 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] But this won t work, because zero gradients (almost everywhere) z = g(ɛ, α) = arg max s K 1 (log α + ɛ) s = λ R θ,λ (z) = 0 Gumbel-Softmax Flows IWAE Gumbel-Softmax trick: replace arg max with softmax ( log α + ɛ ) z = softmax τ (where τ is a temperature term.) z k = exp((log α k + ɛ k )/τ) K j=1 exp((log α j + ɛ j )/τ) [ ( λ E q(z x; λ) [R θ,λ (z)] E ɛ Gumbel λ R θ,λ softmax ( log α + ɛ) )] τ 104/153

149 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] But this won t work, because zero gradients (almost everywhere) z = g(ɛ, α) = arg max s K 1 (log α + ɛ) s = λ R θ,λ (z) = 0 Gumbel-Softmax Flows IWAE Gumbel-Softmax trick: replace arg max with softmax ( log α + ɛ ) z = softmax τ (where τ is a temperature term.) z k = exp((log α k + ɛ k )/τ) K j=1 exp((log α j + ɛ j )/τ) [ ( λ E q(z x; λ) [R θ,λ (z)] E ɛ Gumbel λ R θ,λ softmax ( log α + ɛ) )] τ 104/153

150 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Approaches a discrete distribution as τ 0 (anneal τ during training). Reparameterizable by construction Differentiable and has non-zero gradients Gumbel-Softmax Flows IWAE (from Maddison et al. [2017]) 105/153

151 Gumbel-Softmax Flows IWAE Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] See Maddison et al. [2017] on whether we can use the original categorical densities p(z), q(z), or need to use relaxed densities p GS (z), q GS (z). Requires that p(x z; θ) makes sense for non-discrete z (e.g. attention). Lower-variance, but biased gradient estimator. Variance as τ /153

152 Gumbel-Softmax Flows IWAE 5 Gumbel-Softmax Flows IWAE 6 107/153

153 Gumbel-Softmax Flows IWAE Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Recall log p(x; θ) = ELBO(θ, λ; x) KL[q(z x; λ) p(z x; θ)] Bound is tight when variational posterior equals true posterior q(z x; λ) = p(z x; θ) = log p(x; θ) = ELBO(θ, λ; x) We want to make q(z x; λ) as flexible as possible: can we do better than just Gaussian? 108/153

154 Gumbel-Softmax Flows IWAE Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Recall log p(x; θ) = ELBO(θ, λ; x) KL[q(z x; λ) p(z x; θ)] Bound is tight when variational posterior equals true posterior q(z x; λ) = p(z x; θ) = log p(x; θ) = ELBO(θ, λ; x) We want to make q(z x; λ) as flexible as possible: can we do better than just Gaussian? 108/153

155 Gumbel-Softmax Flows IWAE Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Idea: transform a sample from a simple initial variational distribution, z 0 q(z x; λ) = N (µ, σ 2 ) µ, σ 2 = enc(x; λ) into a more complex one z K = f K f 2 f 1 (z 0 ; λ) where f i (z i 1 ; λ) s are invertible transformations (whose parameters are absorbed by λ). 109/153

156 Gumbel-Softmax Flows IWAE Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Idea: transform a sample from a simple initial variational distribution, z 0 q(z x; λ) = N (µ, σ 2 ) µ, σ 2 = enc(x; λ) into a more complex one z K = f K f 2 f 1 (z 0 ; λ) where f i (z i 1 ; λ) s are invertible transformations (whose parameters are absorbed by λ). 109/153

157 Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Gumbel-Softmax Flows IWAE Sample from final variational posterior is given by z K. Density is given by the change of variables formula: log q K (z K x; λ) = log q(z 0 x; λ) + = log q(z 0 x; λ) }{{} log density of Gaussian k=1 K 1 f k log z k k=1 K log f k z }{{ k 1 } log determinant of Jacobian Determinant calculation is O(N 3 ) in general, but can be made faster depending on parameterization of f k 110/153

158 Gumbel-Softmax Flows IWAE Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Can still use reparameterization to obtain gradients. Letting F (z) = f K f 1 (z), [ p(x, z; θ) ] ELBO(θ, λ; x) = λ E qk (z K x; λ) log q K (z K x; λ) [ = λ E q(z0 x; λ) log p(x, F (z 0); θ) log F ] q(z 0 x; λ) z 0 [ ( = E ɛ N (0,I) λ log p(x, F (z 0); θ) log F )] q(z 0 x; λ) z 0 111/153

159 Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Gumbel-Softmax Flows IWAE Examples of f k (z k 1 ; λ) Normalizing Flows [Rezende and Mohamed 2015] f k (z k 1 ) = z k 1 + u k h(wk z k 1 + b k ) Inverse Autoregressive Flows [Kingma et al. 2016] f k (z k 1 ) = z k 1 σ k + µ k σ k,d = sigmoid(nn(z k 1,<d )) µ k,d = NN(z k 1,<d ) (In this case the Jacobian is upper triangular, so determinant is just the product of diagonals) 112/153

160 Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Gumbel-Softmax Flows IWAE (from Rezende and Mohamed [2015]) 113/153

161 Gumbel-Softmax Flows IWAE 5 Gumbel-Softmax Flows IWAE 6 114/153

162 Gumbel-Softmax Flows IWAE Importance Weighted Autoencoder (IWAE) [Burda et al. 2015] Flows are a way of tightening the ELBO by making the variational family more flexible. Not the only way: can obtain a tighter lower bound on log p(x; θ) by using Consider: multiple importance samples. I K = 1 K where z (1:K) K k=1 q(z(k) x; λ). K k=1 p(x, z (k) ; θ) q(z (k) x; λ), Note that I K is an unbiased estimator of p(x; θ): E q(z (1:K) x; λ) [I K] = p(x; θ). 115/153

163 Gumbel-Softmax Flows IWAE Importance Weighted Autoencoder (IWAE) [Burda et al. 2015] Flows are a way of tightening the ELBO by making the variational family more flexible. Not the only way: can obtain a tighter lower bound on log p(x; θ) by using Consider: multiple importance samples. I K = 1 K where z (1:K) K k=1 q(z(k) x; λ). K k=1 p(x, z (k) ; θ) q(z (k) x; λ), Note that I K is an unbiased estimator of p(x; θ): E q(z (1:K) x; λ) [I K] = p(x; θ). 115/153

164 Importance Weighted Autoencoder (IWAE) [Burda et al. 2015] Gumbel-Softmax Flows IWAE Any unbiased estimator of p(x; θ) can be used to obtain a lower bound, using Jensen s inequality: p(x; θ) = E q(z (1:K) x; λ) [I K] = log p(x; θ) E q(z (1:K) x; λ) [log I K] [ ] = E q(z (1:K) x; λ) log 1 K p(x, z (k) ; θ) K q(z (k) x; λ) However, can also show [Burda et al. 2015]: log p(x; θ) E [log I K ] E [log I K 1 ] k=1 lim K E [log I K ] = log p(x; θ) under mild conditions 116/153

165 Importance Weighted Autoencoder (IWAE) [Burda et al. 2015] Gumbel-Softmax Flows IWAE E q(z (1:K) x; λ) [ log 1 K K k=1 Note that with K = 1, we recover the ELBO. Can interpret p(x,z(k) ; θ) q(z (k) x; λ) as importance weights. ] p(x, z (k) ; θ) q(z (k) x; λ) If q(z x; λ) is reparameterizable, we can use the reparameterization trick to optimize E [log I K ] directly. Otherwise, need score function gradient estimators [Mnih and Rezende 2016]. 117/153

166 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 5 6 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 118/153

167 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 5 6 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 119/153

168 Sentence VAE Example [Bowman et al. 2016] Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Generative Model (Model 2): Draw z N (0, I) Draw x t z CRNNLM(θ, z) Model (Amortized): Deep Diagonal Gaussians, q(z x; λ) = N (µ, σ 2 ) h T = RNN(x; ψ) µ = W 1 ht σ 2 = exp(w 2 ht ) λ = {W 1, W 2, ψ} 120/153

169 Sentence VAE Example [Bowman et al. 2016] (from Bowman et al. [2016]) ɛ Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics λ z z x x 1... x T 121/153

170 Issue 1: Posterior Collapse Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics ELBO(θ, λ) = E q(z x; λ) [log p(x,z; θ) q(z x; λ) ] = E q(z x; λ) [log p(x z; θ)] KL[q(z x; λ) p(z)] }{{}}{{} Reconstruction likelihood Regularizer Model L/ELBO Reconstruction KL RNN LM RNN VAE (On Yahoo Corpus from Yang et al. [2017]) 122/153

171 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Issue 1: Posterior Collapse x and z become independent, and p(x, z; θ) reduces to a non-lv language model. Chen et al. [2017]: If it s possible to model p (x) without making use of z, then ELBO optimum is at: p (x) = p(x z; θ) = p(x; θ) q(z x; λ) = p(z) KL[q(z x; λ) p(z)] = 0 123/153

172 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Mitigating Posterior Collapse Use less powerful likelihood models [Miao et al. 2016; Yang et al. 2017], or word dropout [Bowman et al. 2016]. Model LL/ELBO Reconstruction KL RNN LM RNN VAE Word Drop CNN VAE (On Yahoo Corpus from Yang et al. [2017]) 124/153

173 Mitigating Posterior Collapse Gradually anneal multiplier on KL term, i.e. E q(z x; λ) [log p(x z; θ)] β KL[q(z x; λ) p(z)] β goes from 0 to 1 as training progresses Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics (from Bowman et al. [2016]) 125/153

174 Mitigating Posterior Collapse Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Other approaches: Use auxiliary losses (e.g. train z as part of a topic model) [Dieng et al. 2017; Wang et al. 2018] Use von Mises Fisher distribution with a fixed concentration parameter [Guu et al. 2017; Xu and Durrett 2018] Combine stochastic/amortized variational inference [Kim et al. 2018] Add skip connections [Dieng et al. 2018] In practice, often necessary to combine various methods. 126/153

175 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Issue 2: Evaluation ELBO always lower bounds log p(x; θ), so can calculate an upper bound on PPL efficiently. When reporting ELBO, should also separately report, KL[q(z x; λ) p(z)] to give an indication of how much the latent variable is being used. 127/153

176 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Issue 2: Evaluation Also can evaluate log p(x; θ) with importance sampling So [ p(x z; θ)p(z) ] p(x; θ) = E q(z x; λ) q(z x; λ) 1 K p(x z (k) ; θ)p(z (k) ) K q(z (k) x; λ) k=1 = log p(x; θ) log 1 K K k=1 p(x z (k) ; θ)p(z (k) ) q(z (k) x; λ) 128/153

177 Evaluation Qualitative evaluation Evaluate samples from prior/variational posterior. Interpolation in latent space. Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics (from Bowman et al. [2016]) 129/153

178 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 5 6 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 130/153

179 Encoder/Decoder [Sutskever et al. 2014; Cho et al. 2014] Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Given: Source information s = s 1,..., s M. Generative process: Draw x 1:T s CRNNLM(θ, enc(s)). 131/153

180 Latent, Per-token Experts [Yang et al. 2018] Generative process: For t = 1,..., T, Draw z t x <t, s softmax(uh t ). Draw x t z t, x <t, s softmax(w tanh(q zt h t ); θ) z (n) 1 z (n) 2 z (n) T Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics... x (n) x (n) 1 2 x (n) T N If U R K d, used K experts; increases the flexibility of per-token distribution. 132/153

181 Case-Study: Latent Per-token Experts [Yang et al. 2018] Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Learning: z t are independent given x <t, so we can marginalize at each time-step (Method 3: Conjugacy). Test-time: arg max log p(x s; θ) = θ T K arg max log p(z t =k s, x <t ; θ) p(x t z t =k, x <t, s; θ). θ arg max x 1:T t=1 k=1 T t=1 k=1 K p(z t =k s, x <t ; θ) p(x t z t =k, x <t, s; θ). 133/153

182 Case-Study: Latent, Per-token Experts [Yang et al. 2018] PTB language modeling results (s is constant): Model PPL Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Merity et al. [2018] Softmax-mixture [Yang et al. 2018] Dialogue generation results (s is context): Model BLEU Prec Rec No mixture Softmax-mixture [Yang et al. 2018] /153

183 Attention [Bahdanau et al. 2015] Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Decoding with an attention mechanism: M x t x <t, s softmax(w [h t, α t,m enc(s) m ]). m=1 135/153

184 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Copy Attention [Gu et al. 2016; Gulcehre et al. 2016] Copy attention models copying words directly from s. Generative process: For t = 1,..., T, Set α t to be attention weights. Draw z t x <t, s Bern(MLP([h t, enc(s)])). If z t = 0 Draw x t z t, x <t, s softmax(w h t ). Else Draw x t {s 1,..., s M } z t, x <t, s Cat(α t ). 136/153

185 Copy Attention Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Learning: Can maximize the log per-token marginal [Gu et al. 2016], as with per-token experts: Test-time: max log p(x 1,..., x T s; θ) θ T = max log p(z t = z s, x <t ; θ) p(x t z, x <t, x; θ). θ arg max x 1:T t=1 z {0,1} T t=1 z {0,1} p(z t =z s, x <t ; θ) p(x t z, x <t, s; θ). 137/153

186 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Attention as a Latent Variable [Deng et al. 2018] Generative process: For t = 1,..., T, Set α t to be attention weights. Draw z t x <t, s Cat(α t ). Draw x t z t, x <t, s softmax(w [h t, enc(s zt )]; θ). 138/153

187 Attention as a Latent Variable [Deng et al. 2018] Marginal likelihood under latent attention model: T M p(x 1:T s; θ) = α t,m softmax(w [h t, enc(s m )]; θ) xt. t=1 m=1 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Standard attention likelihood: T M p(x 1:T s; θ) = softmax(w [h t, α t,m enc(s m )]; θ) xt. t=1 m=1 139/153

188 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Attention as a Latent Variable [Deng et al. 2018] Learning Strategy #1: Maximize the log marginal via enumeration as above. Learning Strategy #2: Maximize the ELBO with AVI: max E q(z t; λ) [log p(x t x <t, z t, s)] KL[q(z t ; λ) p(z t x <t, s)]. λ,θ q(z t x; λ) approximates p(z t x 1:T, s; θ); implemented with a BLSTM. q isn t reparameterizable, so gradients obtained using REINFORCE + baseline. 140/153

189 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Attention as a Latent Variable [Deng et al. 2018] Test-time: Calculate p(x t x <t, s; θ) by summing out z t. MT Results on IWSLT-2014: Model PPL BLEU Standard Attn Latent Attn (marginal) Latent Attn (ELBO) /153

190 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Encoder/Decoder with Structured Latent Variables At least two EMNLP 2018 papers augment encoder/decoder text generation models with structured latent variables: 1 Lee et al. [2018] generate x 1:T by iteratively refining sequences of words z 1:T. 2 Wiseman et al. [2018] generate x 1:T conditioned on a latent template or plan z 1:S. 142/153

191 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 5 6 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 143/153

192 Summary as a Latent Variable [Miao and Blunsom 2016] Generative process for a document x = x 1,..., x T : Draw a latent summary z 1,..., z M RNNLM(θ) Draw x 1,..., x T z 1:M CRNNLM(θ, z) Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Posterior : p(z 1:M x 1:T ; θ) = p(summary document; θ). 144/153

193 Summary as a Latent Variable [Miao and Blunsom 2016] Generative process for a document x = x 1,..., x T : Draw a latent summary z 1,..., z M RNNLM(θ) Draw x 1,..., x T z 1:M CRNNLM(θ, z) Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Posterior : p(z 1:M x 1:T ; θ) = p(summary document; θ). 144/153

194 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Summary as a Latent Variable [Miao and Blunsom 2016] Learning: Maximize the ELBO with amortized family: max E q(z λ,θ 1:M ; λ) [log p(x 1:T z 1:M ; θ)] KL[q(z 1:M ; λ) p(z 1:M ; θ)] q(z 1:M ; λ) approximates p(z 1:M x 1:T ; θ); also implemented with encoder/decoder RNNs. q(z 1:M ; λ) not reparameterizable, so gradients use REINFORCE + baselines. 145/153

195 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Summary as a Latent Variable [Miao and Blunsom 2016] Semi-supervised Training: Can also use documents without corresponding summaries in training. Train q(z 1:M ; λ) p(z 1:M x 1:T ; θ) with labeled examples. Infer summary z for an unlabeled document with q. Use inferred z to improve model p(x 1:T z 1:M ; θ). Allows for outperforming strictly supervised models! 146/153

196 Topic [Blei et al. 2003] Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Generative process: for each document x (n) = x (n) 1,..., x(n) T, Draw topic distribution z (n) top Dir(α) For t = 1,..., T : Draw topic z (n) t Cat(z (n) top) Draw x t Cat(β (n) z ) t 147/153

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

Latent Variable Models

Latent Variable Models Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Discrete Latent Variable Models

Discrete Latent Variable Models Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 14 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 1 / 29 Summary Major themes in the course

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian

More information

arxiv: v2 [cs.cl] 1 Jan 2019

arxiv: v2 [cs.cl] 1 Jan 2019 Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Note Set 5: Hidden Markov Models

Note Set 5: Hidden Markov Models Note Set 5: Hidden Markov Models Probabilistic Learning: Theory and Algorithms, CS 274A, Winter 2016 1 Hidden Markov Models (HMMs) 1.1 Introduction Consider observed data vectors x t that are d-dimensional

More information

Variational Autoencoder

Variational Autoencoder Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Bayesian Semi-supervised Learning with Deep Generative Models

Bayesian Semi-supervised Learning with Deep Generative Models Bayesian Semi-supervised Learning with Deep Generative Models Jonathan Gordon Department of Engineering Cambridge University jg801@cam.ac.uk José Miguel Hernández-Lobato Department of Engineering Cambridge

More information

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p

More information

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016 Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? Want to estimate some distribution, p*(x) p*(x) What is

More information

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

Backpropagation Through

Backpropagation Through Backpropagation Through Backpropagation Through Backpropagation Through Will Grathwohl Dami Choi Yuhuai Wu Geoff Roeder David Duvenaud Where do we see this guy? L( ) =E p(b ) [f(b)] Just about everywhere!

More information

Probabilistic Graphical Models for Image Analysis - Lecture 4

Probabilistic Graphical Models for Image Analysis - Lecture 4 Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

BACKPROPAGATION THROUGH THE VOID

BACKPROPAGATION THROUGH THE VOID BACKPROPAGATION THROUGH THE VOID Optimizing Control Variates for Black-Box Gradient Estimation 27 Nov 2017, University of Cambridge Speaker: Geoffrey Roeder, University of Toronto OPTIMIZING EXPECTATIONS

More information

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain Variational Inference in TensorFlow Danijar Hafner Stanford CS 20 2018-02-16 University College London, Google Brain Outline Variational Inference Tensorflow Distributions VAE in TensorFlow Variational

More information

Master 2 Informatique Probabilistic Learning and Data Analysis

Master 2 Informatique Probabilistic Learning and Data Analysis Master 2 Informatique Probabilistic Learning and Data Analysis Faicel Chamroukhi Maître de Conférences USTV, LSIS UMR CNRS 7296 email: chamroukhi@univ-tln.fr web: chamroukhi.univ-tln.fr 2013/2014 Faicel

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng Deep Sequence Models Context Representation, Regularization, and Application to Language Adji Bousso Dieng All Data Are Born Sequential Time underlies many interesting human behaviors. Elman, 1990. Why

More information

Unsupervised Learning

Unsupervised Learning CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Variational Inference II: Mean Field Method and Variational Principle Junming Yin Lecture 15, March 7, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection

More information

Variational Bayes on Monte Carlo Steroids

Variational Bayes on Monte Carlo Steroids Variational Bayes on Monte Carlo Steroids Aditya Grover, Stefano Ermon Department of Computer Science Stanford University {adityag,ermon}@cs.stanford.edu Abstract Variational approaches are often used

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information

Bayesian Deep Learning

Bayesian Deep Learning Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference

More information

Variational Inference for Monte Carlo Objectives

Variational Inference for Monte Carlo Objectives Variational Inference for Monte Carlo Objectives Andriy Mnih, Danilo Rezende June 21, 2016 1 / 18 Introduction Variational training of directed generative models has been widely adopted. Results depend

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

More information

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

More information

Recurrent Latent Variable Networks for Session-Based Recommendation

Recurrent Latent Variable Networks for Session-Based Recommendation Recurrent Latent Variable Networks for Session-Based Recommendation Panayiotis Christodoulou Cyprus University of Technology paa.christodoulou@edu.cut.ac.cy 27/8/2017 Panayiotis Christodoulou (C.U.T.)

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution

GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution Matt Kusner, Jose Miguel Hernandez-Lobato Satya Krishna Gorti University of Toronto Satya Krishna Gorti CSC2547 1 / 16 Introduction

More information

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP,

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

Advances in Variational Inference

Advances in Variational Inference 1 Advances in Variational Inference Cheng Zhang Judith Bütepage Hedvig Kjellström Stephan Mandt arxiv:1711.05597v1 [cs.lg] 15 Nov 2017 Abstract Many modern unsupervised or semi-supervised machine learning

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm Probabilistic Graphical Models 10-708 Homework 2: Due February 24, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 4-8. You must complete all four problems to

More information

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13 Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 13 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 1 / 21 Summary Story so far Representation: Latent

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm Jan-Willem van de Meent, 19 November 2016 1 Hidden Markov Models A hidden Markov model (HMM)

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Stochastic Variational Inference

Stochastic Variational Inference Stochastic Variational Inference David M. Blei Princeton University (DRAFT: DO NOT CITE) December 8, 2011 We derive a stochastic optimization algorithm for mean field variational inference, which we call

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Name: Student number:

Name: Student number: UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL 2018 EXAMINATIONS CSC321H1S Duration 3 hours No Aids Allowed Name: Student number: This is a closed-book test. It is marked out of 35 marks. Please

More information

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models Auto-Encoding Variational Bayes Diederik Kingma and Max Welling Stochastic Backpropagation and Approximate Inference in Deep Generative Models Danilo J. Rezende, Shakir Mohamed, Daan Wierstra Neural Variational

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework

More information

Introduction to Graphical Models

Introduction to Graphical Models Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Natural Gradients via the Variational Predictive Distribution

Natural Gradients via the Variational Predictive Distribution Natural Gradients via the Variational Predictive Distribution Da Tang Columbia University datang@cs.columbia.edu Rajesh Ranganath New York University rajeshr@cims.nyu.edu Abstract Variational inference

More information