Deep Latent-Variable Models of Natural Language
|
|
- Doris Cox
- 5 years ago
- Views:
Transcription
1 Deep Latent-Variable of Natural Language Yoon Kim, Sam Wiseman, Alexander Rush Tutorial /153
2 Goals Background 1 Goals Background /153
3 Goals Background 1 Goals Background /153
4 Goals Background Goal of Latent-Variable Modeling Probabilistic models provide a declarative language for specifying prior knowledge and structural relationships in the context of unknown variables. Makes it easy to specify: Known interactions in the data Uncertainty about unknown factors Constraints on model properties 4/153
5 Goals Background Goal of Latent-Variable Modeling Probabilistic models provide a declarative language for specifying prior knowledge and structural relationships in the context of unknown variables. Makes it easy to specify: Known interactions in the data Uncertainty about unknown factors Constraints on model properties 4/153
6 Goals Background Latent-Variable Modeling in NLP Long and rich history of latent-variable models of natural language. Major successes include, among many others: Statistical alignment for translation Document clustering and topic modeling Unsupervised part-of-speech tagging and parsing 5/153
7 Goals Background Goals of Deep Learning Toolbox of methods for learning rich, non-linear data representations through numerical optimization. Makes it easy to fit: Highly-flexible predictive models Transferable feature representations Structurally-aligned network architectures 6/153
8 Goals Background Goals of Deep Learning Toolbox of methods for learning rich, non-linear data representations through numerical optimization. Makes it easy to fit: Highly-flexible predictive models Transferable feature representations Structurally-aligned network architectures 6/153
9 Goals Background Deep Learning in NLP Current dominant paradigm for NLP. Major successes include, among many others: Text classification Neural machine translation NLU Tasks (QA, NLI, etc) 7/153
10 Goals Background Tutorial: Deep Latent-Variable for NLP How should a contemporary ML/NLP researcher reason about latent-variables? What unique challenges come from modeling text with latent variables? What techniques have been explored and shown to be effective in recent papers? We explore these through the lens of variational inference. 8/153
11 Goals Background Tutorial: Deep Latent-Variable for NLP How should a contemporary ML/NLP researcher reason about latent-variables? What unique challenges come from modeling text with latent variables? What techniques have been explored and shown to be effective in recent papers? We explore these through the lens of variational inference. 8/153
12 Tutorial Take-Aways Goals Background 1 A collection of deep latent-variable models for NLP 2 An understanding of a variational objective 3 A toolkit of algorithms for optimization 4 A formal guide to advanced techniques 5 A survey of example applications 6 Code samples and techniques for practical use 9/153
13 Goals Background Tutorial Non-s Not covered (for time, not relevance): Many classical latent-variable approaches. Undirected graphical models such as MRFs Non-likelihood based models such as GANs Sampling-based inference such as MCMC. Details of deep learning architectures. 10/153
14 Goals Background 1 Goals Background /153
15 What are deep networks? Goals Background Deep networks are parameterized non-linear functions; They transform input z into features h using parameters π. Important examples: The multilayer perceptron, h = MLP(z; π) = V σ(w z + b) + a π = {V, W, a, b}, The recurrent neural network, which maps a sequence of inputs z 1:T into a sequence of features h 1:T, h t = RNN(h t 1, z t ; π) = σ(uz t + V h t 1 + b) π = {V, U, b}. 12/153
16 What are deep networks? Goals Background Deep networks are parameterized non-linear functions; They transform input z into features h using parameters π. Important examples: The multilayer perceptron, h = MLP(z; π) = V σ(w z + b) + a π = {V, W, a, b}, The recurrent neural network, which maps a sequence of inputs z 1:T into a sequence of features h 1:T, h t = RNN(h t 1, z t ; π) = σ(uz t + V h t 1 + b) π = {V, U, b}. 12/153
17 Goals Background What are latent variable models? Latent variable models give us a joint distribution p(x, z; θ). x is our observed data z is a collection of latent variables θ are the deterministic parameters of the model, such as the neural network parameters Data consists of N i.i.d samples, N p(x (1:N), z (1:N) ; θ) = p(x (n) z (n) ; θ)p(z (n) ; θ). n=1 13/153
18 Goals Background What are latent variable models? Latent variable models give us a joint distribution p(x, z; θ). x is our observed data z is a collection of latent variables θ are the deterministic parameters of the model, such as the neural network parameters Data consists of N i.i.d samples, N p(x (1:N), z (1:N) ; θ) = p(x (n) z (n) ; θ)p(z (n) ; θ). n=1 13/153
19 Goals Background What are latent variable models? Latent variable models give us a joint distribution p(x, z; θ). x is our observed data z is a collection of latent variables θ are the deterministic parameters of the model, such as the neural network parameters Data consists of N i.i.d samples, N p(x (1:N), z (1:N) ; θ) = p(x (n) z (n) ; θ)p(z (n) ; θ). n=1 13/153
20 Probabilistic Graphical Goals Background A directed PGM shows the conditional independence structure. By chain rule, latent variable model over observations can be represented as, θ z (n) x (n) N N p(x (1:N), z (1:N) ; θ) = p(x (n) z (n) ; θ)p(z (n) ; θ) n=1 Specific models may factor further. 14/153
21 Goals Background Posterior For models p(x, z; θ), we ll be interested in the posterior over latent variables z: Why? p(z x; θ) = p(x, z; θ) p(x; θ). z will often represent interesting information about our data (e.g., the cluster x (n) lives in, how similar x (n) and x (n+1) are). Learning the parameters θ of the model often requires calculating posteriors as a subroutine. Intuition: if I know likely z (n) for x (n), I can learn by maximizing p(x (n) z (n) ; θ). 15/153
22 Goals Background Posterior For models p(x, z; θ), we ll be interested in the posterior over latent variables z: Why? p(z x; θ) = p(x, z; θ) p(x; θ). z will often represent interesting information about our data (e.g., the cluster x (n) lives in, how similar x (n) and x (n+1) are). Learning the parameters θ of the model often requires calculating posteriors as a subroutine. Intuition: if I know likely z (n) for x (n), I can learn by maximizing p(x (n) z (n) ; θ). 15/153
23 Goals Background Problem Statement: Two Views Deep & LV are naturally complementary: Rich function approximators with modular parts. Declarative methods for specifying model constraints. Deep & LV are frustratingly incompatible: Deep networks make posterior inference intractable. Latent variable objectives complicate backpropagation. 16/153
24 Goals Background Problem Statement: Two Views Deep & LV are naturally complementary: Rich function approximators with modular parts. Declarative methods for specifying model constraints. Deep & LV are frustratingly incompatible: Deep networks make posterior inference intractable. Latent variable objectives complicate backpropagation. 16/153
25 1 Discrete Continuous Structured 2 Discrete Continuous Structured /153
26 Discrete Continuous Structured A Language Model Our goal is to model a sentence, x 1... x T. Context: RNN language models are remarkable at this task, x 1:T RNNLM(x 1:T ; θ). Defined as, T T p(x 1:T ) = p(x t x <t ) = softmax(w h t ) xt t=1 t=1 where h t = RNN(h t 1, x t 1 ; θ)... x (n) 1 x (n) T N θ 18/153
27 Discrete Continuous Structured A Language Model Our goal is to model a sentence, x 1... x T. Context: RNN language models are remarkable at this task, x 1:T RNNLM(x 1:T ; θ). Defined as, T T p(x 1:T ) = p(x t x <t ) = softmax(w h t ) xt t=1 t=1 where h t = RNN(h t 1, x t 1 ; θ)... x (n) 1 x (n) T N θ 18/153
28 Discrete Continuous Structured A Collection of Model Archetypes Focus: semi-supervised or unsupervised learning, i.e. don t just learn the probabilities, but the process. Range of choices in selecting z 1 Discrete LVs z (Clustering) 2 Continuous LVs z (Dimensionality reduction) 3 Structured LVs z (Structured learning) 19/153
29 Discrete Continuous Structured A Collection of Model Archetypes Focus: semi-supervised or unsupervised learning, i.e. don t just learn the probabilities, but the process. Range of choices in selecting z 1 Discrete LVs z (Clustering) 2 Continuous LVs z (Dimensionality reduction) 3 Structured LVs z (Structured learning) 19/153
30 1 Discrete Continuous Structured 2 Discrete Continuous Structured /153
31 Model 1: Discrete Clustering Discrete Continuous Structured Process: In an old house in Paris that was covered with vines lived twelve little girls in two straight lines. Cluster 23 Discrete latent variable models induce a clustering over sentences x (n). Example uses: Document/sentence clustering [Willett 1988; Aggarwal and Zhai 2012]. Mixture of expert text generation models [Jacobs et al. 1991; Garmash and Monz 2016; Lee et al. 2016] 21/153
32 Model 1: Discrete Clustering Discrete Continuous Structured Process: In an old house in Paris that was covered with vines lived twelve little girls in two straight lines. Cluster 23 Discrete latent variable models induce a clustering over sentences x (n). Example uses: Document/sentence clustering [Willett 1988; Aggarwal and Zhai 2012]. Mixture of expert text generation models [Jacobs et al. 1991; Garmash and Monz 2016; Lee et al. 2016] 21/153
33 Model 1: Discrete - Mixture of Categoricals Discrete Continuous Structured Generative process: 1 Draw cluster z {1,..., K} from a categorical with param µ. 2 Draw word T words x t from a categorical with word distribution π z. Parameters: θ = {µ K 1, K V stochastic matrix π} Gives rise to the Naive Bayes distribution: T p(x, z; θ) = p(z; µ) p(x z; π) = µ z Cat(x t ; π) = µ z t=1 T t=1 π z,xt 22/153
34 Model 1: Graphical Model View µ Discrete Continuous Structured π z (n)... x (n) 1 x (n) T N N p(x (n), z (n) ; µ, π) = n=1 = N p(z (n) ; µ) p(x (n) z (n) ; π) n=1 N T µ z (n) n=1 t=1 π z (n),x (n) t 23/153
35 Deep Model 1: Discrete - Mixture of RNNs Discrete Continuous Structured Generative process: 1 Draw cluster z {1,..., K} from a categorical. 2 Draw words x 1:T from RNNLM with parameters π z. p(x, z; θ) = µ z RNNLM(x 1:T ; π z ) µ z (n) π... x (n) 1 x (n) T N 24/153
36 Discrete Continuous Structured Difference Between Dependence structure: Mixture of Categoricals: x t independent of other x j given z. Mixture of RNNs: x t fully dependent. Interesting question: how will this affect the learned latent space? Number of parameters: Mixture of Categoricals: K V. Mixture of RNNs: K d 2 + V d with RNN with d hidden dims. 25/153
37 Discrete Continuous Structured Difference Between Dependence structure: Mixture of Categoricals: x t independent of other x j given z. Mixture of RNNs: x t fully dependent. Interesting question: how will this affect the learned latent space? Number of parameters: Mixture of Categoricals: K V. Mixture of RNNs: K d 2 + V d with RNN with d hidden dims. 25/153
38 Posterior Discrete Continuous Structured For both discrete models, can apply Bayes rule: p(z) p(x z) p(z x; θ) = p(x) p(z) p(x z) = K p(z=k) p(x z=k) k=1 For mixture of categoricals, posterior uses word counts under each π k. For mixture of RNNs, posterior requires running RNN over x for each k. 26/153
39 Posterior Discrete Continuous Structured For both discrete models, can apply Bayes rule: p(z) p(x z) p(z x; θ) = p(x) p(z) p(x z) = K p(z=k) p(x z=k) k=1 For mixture of categoricals, posterior uses word counts under each π k. For mixture of RNNs, posterior requires running RNN over x for each k. 26/153
40 Posterior Discrete Continuous Structured For both discrete models, can apply Bayes rule: p(z) p(x z) p(z x; θ) = p(x) p(z) p(x z) = K p(z=k) p(x z=k) k=1 For mixture of categoricals, posterior uses word counts under each π k. For mixture of RNNs, posterior requires running RNN over x for each k. 26/153
41 1 Discrete Continuous Structured 2 Discrete Continuous Structured /153
42 Model 2: Continuous / Dimensionality Reduction Discrete Continuous Structured Process: In an old house in Paris that was covered with vines lived twelve little girls in two straight lines. Find a lower-dimensional, well-behaved continuous representation of a sentence. Latent variables in R d make distance/similarity easy. Examples: Recent work in text generation assumes a latent vector per sentence [Bowman et al. 2016; Yang et al. 2017; Hu et al. 2017]. Certain sentence embeddings (e.g., Skip-Thought vectors [Kiros et al. 2015]) can be interpreted in this way. 28/153
43 Model 2: Continuous / Dimensionality Reduction Discrete Continuous Structured Process: In an old house in Paris that was covered with vines lived twelve little girls in two straight lines. Find a lower-dimensional, well-behaved continuous representation of a sentence. Latent variables in R d make distance/similarity easy. Examples: Recent work in text generation assumes a latent vector per sentence [Bowman et al. 2016; Yang et al. 2017; Hu et al. 2017]. Certain sentence embeddings (e.g., Skip-Thought vectors [Kiros et al. 2015]) can be interpreted in this way. 28/153
44 Model 2: Continuous Mixture Discrete Continuous Structured Generative Process: 1 Draw continuous latent variable z from Normal with param µ. 2 For each t, draw word x t from categorical with param softmax(w z). Parameters: θ = {µ R d, π}, π = {W R V d } Intuition: µ is a global distribution, z captures local word distribution of the sentence. 29/153
45 Graphical Model View µ Discrete Continuous Structured π z (n)... x (n) 1 x (n) T N Gives rise to the joint distribution: N N p(x (n), z (n) ; θ) = p(z (n) ; µ) p(x (n) z (n) ; π) n=1 n=1 30/153
46 Discrete Continuous Structured Deep Model 2: Continuous Mixture of RNNs Generative Process: 1 Draw latent variable z N (µ, I). 2 Draw each token x t from a conditional RNNLM. RNN is also conditioned on latent z, p(x, z; π, µ, I) = p(z; µ, I) p(x z; π) = N (z; µ, I) CRNNLM(x 1:T ; π, z) where T CRNNLM(x 1:T ; π, z) = softmax(w h t ) xt t=1 h t = RNN(h t 1, [x t 1 ; z]; π) 31/153
47 Discrete Continuous Structured Deep Model 2: Continuous Mixture of RNNs Generative Process: 1 Draw latent variable z N (µ, I). 2 Draw each token x t from a conditional RNNLM. RNN is also conditioned on latent z, p(x, z; π, µ, I) = p(z; µ, I) p(x z; π) = N (z; µ, I) CRNNLM(x 1:T ; π, z) where T CRNNLM(x 1:T ; π, z) = softmax(w h t ) xt t=1 h t = RNN(h t 1, [x t 1 ; z]; π) 31/153
48 Discrete Continuous Structured Deep Model 2: Continuous Mixture of RNNs Generative Process: 1 Draw latent variable z N (µ, I). 2 Draw each token x t from a conditional RNNLM. RNN is also conditioned on latent z, p(x, z; π, µ, I) = p(z; µ, I) p(x z; π) = N (z; µ, I) CRNNLM(x 1:T ; π, z) where T CRNNLM(x 1:T ; π, z) = softmax(w h t ) xt t=1 h t = RNN(h t 1, [x t 1 ; z]; π) 31/153
49 Discrete Continuous Structured Graphical Model View µ z (n) π... x (n) 1 x (n) T N 32/153
50 Discrete Continuous Structured Posterior For continuous models, Bayes rule is harder to compute, p(z x; θ) = z p(z; µ) p(x z; π) p(z; µ) p(x z; π) dz Shallow and deep Model 2 variants mirror Model 1 variants exactly, but with continuous z. Integral intractable (in general) for both shallow and deep variants. 33/153
51 Discrete Continuous Structured Posterior For continuous models, Bayes rule is harder to compute, p(z x; θ) = z p(z; µ) p(x z; π) p(z; µ) p(x z; π) dz Shallow and deep Model 2 variants mirror Model 1 variants exactly, but with continuous z. Integral intractable (in general) for both shallow and deep variants. 33/153
52 1 Discrete Continuous Structured 2 Discrete Continuous Structured /153
53 Model 3: Structure Learning Discrete Continuous Structured Process: In an old house in Paris that was covered with vines lived twelve little girls in two straight lines. Structured latent variable models are used to infer unannotated structure: Unsupervised POS tagging [Brown et al. 1992; Merialdo 1994; Smith and Eisner 2005] Unsupervised dependency parsing [Klein and Manning 2004; Headden III et al. 2009] Or when structure is useful for interpreting our data: Segmentation of documents into topical passages [Hearst 1997] Alignment [Vogel et al. 1996] 35/153
54 Model 3: Structured - Hidden Markov Model Discrete Continuous Structured Generative Process: 1 For each t, draw z t {1,..., K} from a categorical with param µ zt 1. 2 Draw observed token x t from categorical with param π zt. Parameters: θ = {K K stochastic matrix µ, K V stochastic matrix π} Gives rise to the joint distribution: p(x, z; θ) = = T T p(z t z t 1 ; µ zt 1 ) p(x t z t ; π zt ) t=1 t=1 T T µ zt 1,z t t=1 t=1 π zt,x t 36/153
55 Model 3: Structured - Hidden Markov Model Discrete Continuous Structured Generative Process: 1 For each t, draw z t {1,..., K} from a categorical with param µ zt 1. 2 Draw observed token x t from categorical with param π zt. Parameters: θ = {K K stochastic matrix µ, K V stochastic matrix π} Gives rise to the joint distribution: p(x, z; θ) = = T T p(z t z t 1 ; µ zt 1 ) p(x t z t ; π zt ) t=1 t=1 T T µ zt 1,z t t=1 t=1 π zt,x t 36/153
56 Graphical Model View µ z 1 z 2 z 3 z 4 Discrete Continuous Structured x 1 x 2 x 3 x 4 π N p(x, z; θ) = = T T p(z t z t 1 ; µ zt 1 ) p(x t z t ; π zt ) t=1 t=1 T T µ zt 1,z t t=1 t=1 π zt,x t 37/153
57 Further Extension: Factorial HMM z 3,1 z 3,2 z 3,3 z 3,4 Discrete Continuous Structured z 2,1 z 2,2 z 2,3 z 2,4 z 1,1 z 1,2 z 1,3 z 1,4 x 1 x 2 x 3 x 4 N L T T p(x, z; θ) = p(z l,t z l,t 1 ) p(x t z 1:L,t ) l=1 t=1 t=1 38/153
58 Deep Model 3: Deep HMM Discrete Continuous Structured Parameterize transition and emission distributions with neural networks (c.f., Tran et al. [2016]) Model transition distribution as p(z t z t 1 ) = softmax(mlp(z t 1 ; µ)) Model emission distribution as p(x t z t ) = softmax(mlp(z t ; π)) Note: K K transition parameters for standard HMM vs. O(K d + d 2 ) for deep version. 39/153
59 Deep Model 3: Deep HMM Discrete Continuous Structured Parameterize transition and emission distributions with neural networks (c.f., Tran et al. [2016]) Model transition distribution as p(z t z t 1 ) = softmax(mlp(z t 1 ; µ)) Model emission distribution as p(x t z t ) = softmax(mlp(z t ; π)) Note: K K transition parameters for standard HMM vs. O(K d + d 2 ) for deep version. 39/153
60 Graphical Model View µ z 1 z 2 z 3 z 4 Discrete Continuous Structured x 1 x 2 x 3 x 4 π N p(x, z; θ) = = T T p(z t z t 1 ; µ zt 1 ) p(x t z t ; π zt ) t=1 t=1 T T µ zt 1,z t t=1 t=1 π zt,x t 40/153
61 Posterior Discrete Continuous Structured For structured models, Bayes rule may tractable, p(z x; θ) = p(z; µ) p(x z; π) z p(z ; µ) p(x z ; π) Unlike previous models, z contains interdependent parts. For both shallow and deep Model 3 variants, it s possible to calculate p(x; θ) exactly, with a dynamic program. For some structured models, like Factorial HMM, the dynamic program may still be intractable. 41/153
62 Posterior Discrete Continuous Structured For structured models, Bayes rule may tractable, p(z x; θ) = p(z; µ) p(x z; π) z p(z ; µ) p(x z ; π) Unlike previous models, z contains interdependent parts. For both shallow and deep Model 3 variants, it s possible to calculate p(x; θ) exactly, with a dynamic program. For some structured models, like Factorial HMM, the dynamic program may still be intractable. 41/153
63 1 Maximum Likelihood ELBO 2 3 Maximum Likelihood ELBO /153
64 1 Maximum Likelihood ELBO 2 3 Maximum Likelihood ELBO /153
65 Maximum Likelihood ELBO Learning with Maximum Likelihood : Find model parameters θ that maximize the likelihood of the data, θ = arg max θ N log p(x (n) ; θ) n=1 44/153
66 Learning Deep Maximum Likelihood ELBO N L(θ) = log p(x (n) ; θ) n=1 x N θ Dominant framework is gradient-based optimization: θ (i) = θ (i 1) + η θ L(θ) θ L(θ) calculated with backpropagation. Tactics: mini-batch based training, adaptive learning rates [Duchi et al. 2011; Kingma and Ba 2015]. 45/153
67 Maximum Likelihood ELBO Learning Deep Latent-Variable : Marginalization Likelihood requires summing out the latent variables, p(x; θ) = p(x, z; θ) (= p(x, z; θ)dz if continuous z) z Z In general, hard to optimize log-likelihood for the training set, N L(θ) = log p(x (n), z; θ) z Z n=1 θ z (n) x (n) N 46/153
68 Maximum Likelihood ELBO Learning Deep Latent-Variable : Marginalization Likelihood requires summing out the latent variables, p(x; θ) = p(x, z; θ) (= p(x, z; θ)dz if continuous z) z Z In general, hard to optimize log-likelihood for the training set, N L(θ) = log p(x (n), z; θ) z Z n=1 θ z (n) x (n) N 46/153
69 1 Maximum Likelihood ELBO 2 3 Maximum Likelihood ELBO /153
70 High-level: decompose objective into lower-bound and gap. Maximum Likelihood ELBO L(θ) GAP(θ, λ) LB(θ, λ) L(θ) = LB(θ, λ) + GAP(θ, λ) for some λ Provides framework for deriving a rich set of optimization algorithms. 48/153
71 Marginal Likelihood: Decomposition Maximum Likelihood ELBO For any 1 distribution q(z x; λ) over z, [ p(x, z; θ) ] L(θ) = E q log + KL[q(z x; λ) p(z x; θ)] q(z x; λ) posterior gap ELBO (evidence lower bound) Since KL is always non-negative, L(θ) ELBO(θ, λ). 1 Technical condition: supp(q(z)) supp(p(z x; θ)) 49/153
72 Evidence Lower Bound: Proof Maximum Likelihood ELBO log p(x; θ) = E q log p(x) (Expectation over z) p(x, z) = E q log (Mult/div by p(z x), combine numerator) p(z x) ( ) p(x, z) q(z x) = E q log (Mult/div by q(z x)) q(z x) p(z x) p(x, z) = E q log q(z x) + E q(z x) q log (Split Log) p(z x) p(x, z; θ) = E q log + KL[q(z x; λ) p(z x; θ)] q(z x; λ) 50/153
73 Evidence Lower Bound: Proof Maximum Likelihood ELBO log p(x; θ) = E q log p(x) (Expectation over z) p(x, z) = E q log (Mult/div by p(z x), combine numerator) p(z x) ( ) p(x, z) q(z x) = E q log (Mult/div by q(z x)) q(z x) p(z x) p(x, z) = E q log q(z x) + E q(z x) q log (Split Log) p(z x) p(x, z; θ) = E q log + KL[q(z x; λ) p(z x; θ)] q(z x; λ) 50/153
74 Evidence Lower Bound: Proof Maximum Likelihood ELBO log p(x; θ) = E q log p(x) (Expectation over z) p(x, z) = E q log (Mult/div by p(z x), combine numerator) p(z x) ( ) p(x, z) q(z x) = E q log (Mult/div by q(z x)) q(z x) p(z x) p(x, z) = E q log q(z x) + E q(z x) q log (Split Log) p(z x) p(x, z; θ) = E q log + KL[q(z x; λ) p(z x; θ)] q(z x; λ) 50/153
75 Evidence Lower Bound: Proof Maximum Likelihood ELBO log p(x; θ) = E q log p(x) (Expectation over z) p(x, z) = E q log (Mult/div by p(z x), combine numerator) p(z x) ( ) p(x, z) q(z x) = E q log (Mult/div by q(z x)) q(z x) p(z x) p(x, z) = E q log q(z x) + E q(z x) q log (Split Log) p(z x) p(x, z; θ) = E q log + KL[q(z x; λ) p(z x; θ)] q(z x; λ) 50/153
76 Evidence Lower Bound: Proof Maximum Likelihood ELBO log p(x; θ) = E q log p(x) (Expectation over z) p(x, z) = E q log (Mult/div by p(z x), combine numerator) p(z x) ( ) p(x, z) q(z x) = E q log (Mult/div by q(z x)) q(z x) p(z x) p(x, z) = E q log q(z x) + E q(z x) q log (Split Log) p(z x) p(x, z; θ) = E q log + KL[q(z x; λ) p(z x; θ)] q(z x; λ) 50/153
77 Evidence Lower Bound over Observations Maximum Likelihood ELBO [ p(x, z; θ) ] ELBO(θ, λ; x) = E q(z) log q(z x; λ) ELBO is a function of the generative model parameters, θ, and the variational parameters, λ. N log p(x (n) ; θ) n=1 = N ELBO(θ, λ; x (n) ) n=1 N n=1 E q(z x (n) ; λ) [ log p(x(n), z; θ) ] q(z x (n) ; λ) = ELBO(θ, λ; x (1:N) ) = ELBO(θ, λ) 51/153
78 Maximum Likelihood ELBO Setup: Selecting Family Just as with p and θ, we can select any form of q and λ that satisfies ELBO conditions. Different choices of q will lead to different algorithms. We will explore several forms of q: Posterior Point Estimate / MAP Amortized Mean Field (later) 52/153
79 Example Family : Full Posterior Form z (n) KL(q(z x) p(z x)) z (n) Maximum Likelihood ELBO λ (n) N x (n) N θ λ = [λ (1),..., λ (N) ] is a concatenation of local variational parameters λ (n), e.g. q(z (n) x (n) ; λ) = q(z (n) x (n) ; λ (n) ) = N (λ (n), 1) 53/153
80 Example Family: Amortized Parameterization [Kingma and Welling 2014] λ z (n) KL(q(z x) p(z x)) z (n) Maximum Likelihood ELBO x (n) N x (n) N θ λ parameterizes a global network (encoder/inference network) that is run over x (n) to produce the local variational distribution, e.g. q(z (n) x (n) ; λ) = N (µ (n), 1), µ (n) = enc(x (n) ; λ) 54/153
81 1 Exact Gradient Sampling Conjugacy Exact Gradient Sampling Conjugacy /153
82 Maximizing the Evidence Lower Bound Exact Gradient Sampling Conjugacy Central quantity of interest: almost all methods are maximizing the ELBO Aggregate ELBO objective, arg max θ,λ arg max ELBO(θ, λ) θ,λ ELBO(θ, λ) = arg max θ,λ = arg max θ,λ N ELBO(θ, λ; x (n) ) n=1 N [ E q log p(x(n), z (n) ; θ) ] q(z (n) x (n) ; λ) n=1 56/153
83 Maximizing the Evidence Lower Bound Exact Gradient Sampling Conjugacy Central quantity of interest: almost all methods are maximizing the ELBO Aggregate ELBO objective, arg max θ,λ arg max ELBO(θ, λ) θ,λ ELBO(θ, λ) = arg max θ,λ = arg max θ,λ N ELBO(θ, λ; x (n) ) n=1 N [ E q log p(x(n), z (n) ; θ) ] q(z (n) x (n) ; λ) n=1 56/153
84 Maximizing ELBO: Model Parameters [ arg max E q log θ p(x, z; θ) ] q(z x; λ) = arg max E q [log p(x, z; θ)] θ Exact Gradient Sampling Conjugacy ELBO θ ELBO Intuition: Maximum likelihood problem under variables drawn from q(z x; λ). 57/153
85 Exact Gradient Sampling Conjugacy Model Estimation: Gradient Ascent on Model Parameters Easy: Gradient respect to θ [ ] θ ELBO(θ, λ; x) = θ E q log p(x, z; θ) ] = E q [ θ log p(x, z; θ) Since q not dependent on θ, moves inside expectation. Estimate with samples from q. Term log p(x, z; θ) is easy to evaluate. (In practice single sample is often sufficient). In special cases, can exactly evaluate expectation. 58/153
86 Exact Gradient Sampling Conjugacy Model Estimation: Gradient Ascent on Model Parameters Easy: Gradient respect to θ [ ] θ ELBO(θ, λ; x) = θ E q log p(x, z; θ) ] = E q [ θ log p(x, z; θ) Since q not dependent on θ, moves inside expectation. Estimate with samples from q. Term log p(x, z; θ) is easy to evaluate. (In practice single sample is often sufficient). In special cases, can exactly evaluate expectation. 58/153
87 Exact Gradient Sampling Conjugacy Maximizing ELBO: Distribution arg max ELBO(θ, λ) λ L = arg max log p(x; θ) KL[q(z x; λ) p(z x; θ)] λ = arg min KL[q(z x; λ) p(z x; θ)] λ λ posterior gap ELBO Intuition: q should approximate the posterior p(z x). However, may be difficult if q or p is a deep model. 59/153
88 Model : Gradient Ascent on λ? Exact Gradient Sampling Conjugacy Hard: Gradient respect to λ λ ELBO(θ, λ; x) = λ E q [ log E q [ λ log p(x, z; θ) ] q(z x; λ) p(x, z; θ) ] q(z x; λ) Cannot naively move inside the expectation, since q depends on λ. This section: in practice: 1 Exact gradient 2 Sampling: score function, reparameterization 3 Conjugacy: closed-form, coordinate ascent 60/153
89 Model : Gradient Ascent on λ? Exact Gradient Sampling Conjugacy Hard: Gradient respect to λ λ ELBO(θ, λ; x) = λ E q [ log E q [ λ log p(x, z; θ) ] q(z x; λ) p(x, z; θ) ] q(z x; λ) Cannot naively move inside the expectation, since q depends on λ. This section: in practice: 1 Exact gradient 2 Sampling: score function, reparameterization 3 Conjugacy: closed-form, coordinate ascent 60/153
90 1 Exact Gradient Sampling Conjugacy Exact Gradient Sampling Conjugacy /153
91 Strategy 1: Exact Gradient Exact Gradient Sampling Conjugacy [ p(x, z; θ) ] λ ELBO(θ, λ; x) = λ E q(z x; λ) log q(z x; λ) ( p(x, z; θ) ) = λ q(z x; λ) log q(z x; λ) Naive enumeration: Linear in Z. z Z Depending on structure of q and p, potentially faster with dynamic programming. Applicable mainly to Model 1 and 3 (Discrete and Structured), or Model 2 with point estimate. 62/153
92 Strategy 1: Exact Gradient Exact Gradient Sampling Conjugacy [ p(x, z; θ) ] λ ELBO(θ, λ; x) = λ E q(z x; λ) log q(z x; λ) ( p(x, z; θ) ) = λ q(z x; λ) log q(z x; λ) Naive enumeration: Linear in Z. z Z Depending on structure of q and p, potentially faster with dynamic programming. Applicable mainly to Model 1 and 3 (Discrete and Structured), or Model 2 with point estimate. 62/153
93 λ z Example: Model 1 - Naive Bayes z Exact Gradient Sampling Conjugacy Let q(z x; λ) = Cat(ν) where ν = enc(x; λ) x x 1 λ ELBO(θ, λ; x) = λ E q(z x; λ) [ log ( = λ q(z x; λ) log z Z ( = λ ν z log z Z... p(x, z; θ) ) ν z x T p(x, z; θ) ] q(z x; λ) p(x, z; θ) ) q(z x; λ) 63/153
94 λ z Example: Model 1 - Naive Bayes z Exact Gradient Sampling Conjugacy Let q(z x; λ) = Cat(ν) where ν = enc(x; λ) x x 1 λ ELBO(θ, λ; x) = λ E q(z x; λ) [ log ( = λ q(z x; λ) log z Z ( = λ ν z log z Z... p(x, z; θ) ) ν z x T p(x, z; θ) ] q(z x; λ) p(x, z; θ) ) q(z x; λ) 63/153
95 1 Exact Gradient Sampling Conjugacy Exact Gradient Sampling Conjugacy /153
96 Strategy 2: Sampling [ log p(x, z; θ) ] λ ELBO(θ, λ; x) = λ E q log log q(z x; λ) [ ] [ ] = λ E q log p(x, z; θ) λ E q log q(z x; θ) Exact Gradient Sampling Conjugacy How can we approximate this gradient with sampling? Naive algorithm fails to provide non-zero gradient. λ 1 J z (1),..., z (J) q(z x; λ) J j=1 [ ] log p(x, z (j) ; θ) = 0 Manipulate expression so we can move λ inside E q before sampling. 65/153
97 Strategy 2: Sampling [ log p(x, z; θ) ] λ ELBO(θ, λ; x) = λ E q log log q(z x; λ) [ ] [ ] = λ E q log p(x, z; θ) λ E q log q(z x; θ) Exact Gradient Sampling Conjugacy How can we approximate this gradient with sampling? Naive algorithm fails to provide non-zero gradient. λ 1 J z (1),..., z (J) q(z x; λ) J j=1 [ ] log p(x, z (j) ; θ) = 0 Manipulate expression so we can move λ inside E q before sampling. 65/153
98 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator First term. Use basic identity: log q = q q Policy-gradient style training [Williams 1992] [ ] λ E q log p(x, z; θ) = z q = q log q λ q(z x; λ) log p(x, z; θ) 66/153
99 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator First term. Use basic identity: log q = q q Policy-gradient style training [Williams 1992] [ ] λ E q log p(x, z; θ) = z q = q log q λ q(z x; λ) log p(x, z; θ) }{{} q log q 67/153
100 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator First term. Use basic identity: log q = q q Policy-gradient style training [Williams 1992] [ ] λ E q log p(x, z; θ) = z q = q log q λ q(z x; λ) log p(x, z; θ) = z q(z x; λ) λ log q(z x; λ) log p(x, z; θ) 68/153
101 Strategy 2a: Sampling Score Function Gradient Estimator Exact Gradient Sampling Conjugacy First term. Use basic identity: log q = q q Policy-gradient style training [Williams 1992] [ ] λ E q log p(x, z; θ) = z q = q log q λ q(z x; λ) log p(x, z; θ) = q(z x; λ) λ log q(z x; λ) log p(x, z; θ) z [ ] = E q log p(x, z; θ) λ log q(z x; λ) 69/153
102 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator Second term. Need additional identity: [ ] λ E q log q(z x; λ) = z q = q = 1 = 0 ) λ (q(z x; λ) log q(z x; λ) 70/153
103 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator Second term. Need additional identity: [ ] λ E q log q(z x; λ) = z ( = z λ q(z x; λ) }{{} q log q q = q = 1 = 0 λ (q(z x; λ) log q(z x; λ) ) ( log q(z x; λ) + q(z x; λ) λ log q(z x; λ) }{{} q q ) 71/153
104 Strategy 2a: Sampling Score Function Gradient Estimator Second term. Need additional identity: q = q = 1 = 0 Exact Gradient Sampling Conjugacy [ ] λ E q log q(z x; λ) = z λ (q(z x; λ) log q(z x; λ) = z log q(z x; λ)q(z x; λ) λ log q(z x; λ) + z λ q(z x; λ) 72/153
105 Strategy 2a: Sampling Score Function Gradient Estimator Second term. Need additional identity: q = q = 1 = 0 Exact Gradient Sampling Conjugacy [ ] λ E q log q(z x; λ) = z λ (q(z x; λ) log q(z x; λ) = z log q(z x; λ)q(z x; λ) λ log q(z x; λ) + λ q(z x; λ) z }{{} = q= 1=0 73/153
106 Strategy 2a: Sampling Score Function Gradient Estimator Second term. Need additional identity: q = q = 1 = 0 Exact Gradient Sampling Conjugacy [ ] λ E q log q(z x; λ) = z λ (q(z x; λ) log q(z x; λ) = z log q(z x; λ)q(z x; λ) λ log q(z x; λ) + z λ q(z x; λ) = E q [log q(z x; λ) λ q(z x; λ)] 74/153
107 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator Putting these together, [ p(x, z; θ) ] λ ELBO(θ, λ; x) = λ E q log q(z x; λ) [ p(x, z; θ) ] = E q log q(z x; λ) λ log q(z x; λ) ] = E q [R θ,λ (z) λ log q(z x; λ) 75/153
108 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator Estimate with samples, z (1),..., z (J) q(z x; λ) ] E q [R θ,λ (z) λ log q(z x; λ) 1 J J R θ,λ (z (j) ) λ log q(z (j) x; λ) j=1 Intuition: if a sample z (j) is has high reward R θ,λ (z (j) ), increase the probability of z (j) by moving along the gradient λ log q(z (j) x; λ). 76/153
109 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator Essentially reinforcement learning with reward R θ,λ (z) Score function gradient is generally applicable regardless of what distribution q takes (only need to evaluate λ log q). This generality comes at a cost, since the reward is black-box : unbiased estimator, but high variance. In practice, need variance-reducing control variate B. (More on this later). 77/153
110 Exact Gradient Sampling Conjugacy λ z x Example: Model 1 - Naive Bayes Let q(z x; λ) = Cat(ν) where ν = enc(x; λ) Sample z (1),..., z (J) q(z x; λ) [ p(x, z; θ) ] λ ELBO(θ, λ; x) = E q log q(z x; λ) λ log q(z x; λ) 1 J J j=1 x 1 z... ν z (j) log p(x, z(j) ; θ) λ log ν ν z (j) z (j) x T Computational complexity: O(J) vs O( Z ) 78/153
111 Exact Gradient Sampling Conjugacy λ z x Example: Model 1 - Naive Bayes Let q(z x; λ) = Cat(ν) where ν = enc(x; λ) Sample z (1),..., z (J) q(z x; λ) [ p(x, z; θ) ] λ ELBO(θ, λ; x) = E q log q(z x; λ) λ log q(z x; λ) 1 J J j=1 x 1 z... ν z (j) log p(x, z(j) ; θ) λ log ν ν z (j) z (j) x T Computational complexity: O(J) vs O( Z ) 78/153
112 Exact Gradient Sampling Conjugacy Strategy 2b: Sampling Reparameterization Suppose we can sample from q by applying a deterministic, differentiable transformation g to a base noise density, ɛ U z = g(ɛ, λ) Gradient calculation (first term): [ ] [ ] λ E z q(z x; λ) log p(x, z; θ) = λ E ɛ U log p(x, g(ɛ, λ); θ) ] = E ɛ U [ λ log p(x, g(ɛ, λ); θ) where 1 J J λ log p(x, g(ɛ (j), λ); θ) j=1 ɛ (1),..., ɛ (J) U 79/153
113 Exact Gradient Sampling Conjugacy Strategy 2b: Sampling Reparameterization Suppose we can sample from q by applying a deterministic, differentiable transformation g to a base noise density, ɛ U z = g(ɛ, λ) Gradient calculation (first term): [ ] [ ] λ E z q(z x; λ) log p(x, z; θ) = λ E ɛ U log p(x, g(ɛ, λ); θ) ] = E ɛ U [ λ log p(x, g(ɛ, λ); θ) where 1 J J λ log p(x, g(ɛ (j), λ); θ) j=1 ɛ (1),..., ɛ (J) U 79/153
114 Exact Gradient Sampling Conjugacy Strategy 2b: Sampling Reparameterization Unbiased, like the score function gradient estimator, but empirically lower variance. In practice, single sample is often sufficient. Cannot be used out-of-the-box for discrete z. 80/153
115 Strategy 2: Continuous Latent Variable RNN ɛ Exact Gradient Sampling Conjugacy λ z x x 1 z... Choose variational family to be an amortized diagonal Gaussian q(z x; λ) = N (µ, σ 2 ) µ, σ 2 = enc(x; λ) Then we can sample from q(z x; λ) by x T ɛ N (0, I) z = µ + σɛ 81/153
116 Strategy 2b: Sampling Reparameterization Exact Gradient Sampling Conjugacy (Recall R θ,λ (z) = log Score function: Reparameterization: p(x,z; θ) q(z x; λ) ) λ ELBO(θ, λ; x) = E z q [R θ,λ (z) λ log q(z x; λ)] λ ELBO(θ, λ; x) = E ɛ N (0,I) [ λ R θ,λ (g(ɛ, λ; x))] where g(ɛ, λ; x) = µ + σɛ. Informally, reparameterization gradients differentiate through R θ,λ ( ) and thus has more knowledge about the structure of the objective function. 82/153
117 1 Exact Gradient Sampling Conjugacy Exact Gradient Sampling Conjugacy /153
118 Exact Gradient Sampling Conjugacy Strategy 3: Conjugacy For certain choices for p and q, we can compute parts of exactly in closed-form. Recall that arg max λ arg max ELBO(θ, λ; x) λ ELBO(θ, λ; x) = arg min KL[q(z x; λ) p(z x; θ)] λ 84/153
119 Exact Gradient Sampling Conjugacy Strategy 3: Conjugacy For certain choices for p and q, we can compute parts of exactly in closed-form. Recall that arg max λ arg max ELBO(θ, λ; x) λ ELBO(θ, λ; x) = arg min KL[q(z x; λ) p(z x; θ)] λ 84/153
120 Strategy 3a: Conjugacy Tractable Posterior Exact Gradient Sampling Conjugacy Suppose we can tractably calculate p(z x; θ). Then KL[q(z x; λ) p(z x; θ)] is minimized when, q(z x; λ) = p(z x; θ) The E-step in Expectation Maximization algorithm [Dempster et al. 1977] posterior gap L λ ELBO 85/153
121 Example: Model 1 - Naive Bayes λ z z x x 1... x T Exact Gradient Sampling Conjugacy p(z x; θ) = p(x, z; θ) K z =1 p(x, z ; θ) So λ is given by the parameters of the categorical distribution, i.e. λ = [p(z = 1 x; θ),..., p(z = K x; θ)] 86/153
122 Example: Model 3 HMM µ z 1 z 2 z 3 z 4 Exact Gradient Sampling Conjugacy x 1 x 2 x 3 x 4 N π T p(x, z; θ) = p(z 0 ) p(z t z t 1 ; µ)p(x t z t ; π) t=1 87/153
123 Exact Gradient Sampling Conjugacy Example: Model 3 HMM Run forward/backward dynamic programming to calculate posterior marginals, p(z t, z t+1 x; θ) variational parameters λ R T K2 store edge marginals. These are enough to calculate q(z; λ) = p(z x; θ) (i.e. the exact posterior) over any sequence z. 88/153
124 Exact Gradient Sampling Conjugacy Example: Model 3 HMM Run forward/backward dynamic programming to calculate posterior marginals, p(z t, z t+1 x; θ) variational parameters λ R T K2 store edge marginals. These are enough to calculate q(z; λ) = p(z x; θ) (i.e. the exact posterior) over any sequence z. 88/153
125 Exact Gradient Sampling Conjugacy Connection: Gradient Ascent on Log Marginal Likelihood Why not perform gradient ascent directly on log marginal likelihood? log p(x; θ) = log p(x, z; θ) z Same as optimizing ELBO with posterior inference (i.e EM). Gradients of model parameters given by (where q(z x; λ) = p(z x; θ)): θ log p(x; θ) = E q(z x; λ) [ θ log p(x, z; θ)] posterior gap L ELBO 89/153
126 Exact Gradient Sampling Conjugacy Connection: Gradient Ascent on Log Marginal Likelihood Why not perform gradient ascent directly on log marginal likelihood? log p(x; θ) = log p(x, z; θ) z Same as optimizing ELBO with posterior inference (i.e EM). Gradients of model parameters given by (where q(z x; λ) = p(z x; θ)): θ log p(x; θ) = E q(z x; λ) [ θ log p(x, z; θ)] posterior gap L ELBO 89/153
127 Exact Gradient Sampling Conjugacy Connection: Gradient Ascent on Log Marginal Likelihood Practically, this means we don t have to manually perform posterior inference in the E-step. Can just calculate log p(x; θ) and call backpropagation. Example: in deep HMM, just implement forward algorithm to calculate log p(x; θ) and backpropagate using autodiff. No need to implement backward algorithm. (Or vice versa). (See Eisner [2016]: Inside-Outside and Forward-Backward Algorithms Are Just Backprop ) 90/153
128 Exact Gradient Sampling Conjugacy Strategy 3b: Conditional Conjugacy Let p(z x; θ) be intractable, but suppose p(x, z; θ) is conditionally conjugate, meaning p(z t x, z t ; θ) is exponential family. Restrict the family of distributions q so that it factorizes over z t, i.e. q(z; λ) = (mean field family) T q(z t ; λ t ) Further choose q(z t ; λ t ) so that it is in the same family as p(z t x, z t ; θ). t=1 91/153
129 Strategy 3b: Conditional Conjugacy z (n) 1 z (n) T KL(q(z) p(z x)) z (n) Exact Gradient Sampling Conjugacy λ (n) 1 λ (n) T N q(z; λ) = T q(z t ; λ t ) t=1 x (n) N θ 92/153
130 Mean Field Family Exact Gradient Sampling Conjugacy Optimize ELBO via coordinate ascent, i.e. iterate for λ 1,..., λ T arg max λ t [ T ] KL q(z t ; λ t ) p(z x; θ) Coordinate ascent updates will take the form ( ) q(z t ; λ t ) exp E q(z t ; λ t )[log p(x, z; θ)] where t=1 E q(z t ; λ t )[log p(x, z; θ)] = j t q(z j ; λ j ) log p(x, z; θ) Since p(z t x, z t ) was assumed to be in the exponential family, above updates can be derived in closed form. j t 93/153
131 Example: Model 3 Factorial HMM z 3,1 z 3,2 z 3,3 z 3,4 Exact Gradient Sampling Conjugacy z 2,1 z 2,2 z 2,3 z 2,4 z 1,1 z 1,2 z 1,3 z 1,4 x 1 x 2 x 3 x 4 N L T p(x, z; θ) = p(z l,t z l,t 1 ; θ)p(x t z l,t ; θ) l=1 t=1 94/153
132 Example: Model 3 Factorial HMM Exact Gradient Sampling Conjugacy z 3,1 z 3,2 z 3,3 z 3,4 z 2,1 z 2,2 z 2,3 z 2,4 z 1,1 z 1,2 z 1,3 z 1,4 x 1 x 2 x 3 x 4 ( ) q(z 1,1 ; λ 1,1 ) exp E q(z (1,1) ; λ (1,1) )[log p(x, z; θ)] N 95/153
133 Example: Model 3 Factorial HMM Exact Gradient Sampling Conjugacy z 3,1 z 3,2 z 3,3 z 3,4 z 2,1 z 2,2 z 2,3 z 2,4 z 1,1 z 1,2 z 1,3 z 1,4 x 1 x 2 x 3 x 4 ( ) q(z 2,1 ; λ 2,1 ) exp E q(z (2,1) ; λ (2,1) )[log p(x, z; θ)] N 96/153
134 Exact : Example: Model 3 Factorial HMM Exact Gradient Sampling Conjugacy Naive: K states, L levels = HMM with K L states = O(T K 2L ) Smarter: O(T LK L+1 ) Mean Field: Gaussian emissions: O(T LK 2 ) [Ghahramani and Jordan 1996]. Categorical emission: need more variational approximations, but ultimately O(LKV T ) [Nepal and Yates 2013]. 97/153
135 Exact : Example: Model 3 Factorial HMM Exact Gradient Sampling Conjugacy Naive: K states, L levels = HMM with K L states = O(T K 2L ) Smarter: O(T LK L+1 ) Mean Field: Gaussian emissions: O(T LK 2 ) [Ghahramani and Jordan 1996]. Categorical emission: need more variational approximations, but ultimately O(LKV T ) [Nepal and Yates 2013]. 97/153
136 Gumbel-Softmax Flows IWAE 5 Gumbel-Softmax Flows IWAE 6 98/153
137 Gumbel-Softmax Flows IWAE 1 Gumbel-Softmax: Extend reparameterization to discrete variables. 2 Flows: Optimize a tighter bound by making the variational family q more flexible. 3 Importance Weighting: Optimize a tighter bound through importance sampling. 99/153
138 Gumbel-Softmax Flows IWAE 5 Gumbel-Softmax Flows IWAE 6 100/153
139 Challenges of Discrete Variables Gumbel-Softmax Flows IWAE Review: we can always use score function estimator [ p(x, z; θ) ] λ ELBO(x, θ, λ) = E q log q(z x; λ) λ log q(z x; λ) [( p(x, z; θ) ) ] = E q log q(z x; λ) B λ log q(z x; λ) E q [B λ log q(z x; λ)] = 0 (since E[ log q] = q log q = q = 0) Control variate B (not dependent on z, but can depend on x). Estimate this quantity with another neural net [Mnih and Gregor 2014] ( p(x, z; θ) ) 2 B(x; ψ) log q(z x; λ) 101/153
140 Challenges of Discrete Variables Gumbel-Softmax Flows IWAE Review: we can always use score function estimator [ p(x, z; θ) ] λ ELBO(x, θ, λ) = E q log q(z x; λ) λ log q(z x; λ) [( p(x, z; θ) ) ] = E q log q(z x; λ) B λ log q(z x; λ) E q [B λ log q(z x; λ)] = 0 (since E[ log q] = q log q = q = 0) Control variate B (not dependent on z, but can depend on x). Estimate this quantity with another neural net [Mnih and Gregor 2014] ( p(x, z; θ) ) 2 B(x; ψ) log q(z x; λ) 101/153
141 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Gumbel-Softmax Flows IWAE The Gumbel-Max trick [Papandreou and Yuille 2011] p(z k = 1; α) = α k K j=1 α j where z = [0, 0,..., 1,..., 0] is a one-hot vector. Can sample from p(z; α) by 1 Drawing independent Gumbel noise ɛ = ɛ 1,..., ɛ K ɛ k = log( log u k ) u k U(0, 1) 2 Adding ɛ k to log α k, finding argmax, i.e. i = arg max[log α k + ɛ k ] z i = 1 k 102/153
142 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Gumbel-Softmax Flows IWAE The Gumbel-Max trick [Papandreou and Yuille 2011] p(z k = 1; α) = α k K j=1 α j where z = [0, 0,..., 1,..., 0] is a one-hot vector. Can sample from p(z; α) by 1 Drawing independent Gumbel noise ɛ = ɛ 1,..., ɛ K ɛ k = log( log u k ) u k U(0, 1) 2 Adding ɛ k to log α k, finding argmax, i.e. i = arg max[log α k + ɛ k ] z i = 1 k 102/153
143 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Gumbel-Softmax Flows IWAE The Gumbel-Max trick [Papandreou and Yuille 2011] p(z k = 1; α) = α k K j=1 α j where z = [0, 0,..., 1,..., 0] is a one-hot vector. Can sample from p(z; α) by 1 Drawing independent Gumbel noise ɛ = ɛ 1,..., ɛ K ɛ k = log( log u k ) u k U(0, 1) 2 Adding ɛ k to log α k, finding argmax, i.e. i = arg max[log α k + ɛ k ] z i = 1 k 102/153
144 Gumbel-Softmax Flows IWAE Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Reparameterization: z = arg max s K 1 (log α + ɛ) s = g(ɛ, α) z = g(ɛ, α) is a deterministic function applied to stochastic noise. Let s try applying this: (Recalling R θ,λ (z) = log q(z k = 1 x; λ) = p(x,z; θ) q(z x; λ) ), α k K j=1 α j α = enc(x; λ) λ E q(z x; λ) [R θ,λ (z)] = λ E ɛ Gumbel [R θ,λ (g(ɛ, α)] = E ɛ Gumbel [ λ R θ,λ (g(ɛ, α))] 103/153
145 Gumbel-Softmax Flows IWAE Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Reparameterization: z = arg max s K 1 (log α + ɛ) s = g(ɛ, α) z = g(ɛ, α) is a deterministic function applied to stochastic noise. Let s try applying this: (Recalling R θ,λ (z) = log q(z k = 1 x; λ) = p(x,z; θ) q(z x; λ) ), α k K j=1 α j α = enc(x; λ) λ E q(z x; λ) [R θ,λ (z)] = λ E ɛ Gumbel [R θ,λ (g(ɛ, α)] = E ɛ Gumbel [ λ R θ,λ (g(ɛ, α))] 103/153
146 Gumbel-Softmax Flows IWAE Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Reparameterization: z = arg max s K 1 (log α + ɛ) s = g(ɛ, α) z = g(ɛ, α) is a deterministic function applied to stochastic noise. Let s try applying this: (Recalling R θ,λ (z) = log q(z k = 1 x; λ) = p(x,z; θ) q(z x; λ) ), α k K j=1 α j α = enc(x; λ) λ E q(z x; λ) [R θ,λ (z)] = λ E ɛ Gumbel [R θ,λ (g(ɛ, α)] = E ɛ Gumbel [ λ R θ,λ (g(ɛ, α))] 103/153
147 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] But this won t work, because zero gradients (almost everywhere) z = g(ɛ, α) = arg max s K 1 (log α + ɛ) s = λ R θ,λ (z) = 0 Gumbel-Softmax Flows IWAE Gumbel-Softmax trick: replace arg max with softmax ( log α + ɛ ) z = softmax τ (where τ is a temperature term.) z k = exp((log α k + ɛ k )/τ) K j=1 exp((log α j + ɛ j )/τ) [ ( λ E q(z x; λ) [R θ,λ (z)] E ɛ Gumbel λ R θ,λ softmax ( log α + ɛ) )] τ 104/153
148 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] But this won t work, because zero gradients (almost everywhere) z = g(ɛ, α) = arg max s K 1 (log α + ɛ) s = λ R θ,λ (z) = 0 Gumbel-Softmax Flows IWAE Gumbel-Softmax trick: replace arg max with softmax ( log α + ɛ ) z = softmax τ (where τ is a temperature term.) z k = exp((log α k + ɛ k )/τ) K j=1 exp((log α j + ɛ j )/τ) [ ( λ E q(z x; λ) [R θ,λ (z)] E ɛ Gumbel λ R θ,λ softmax ( log α + ɛ) )] τ 104/153
149 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] But this won t work, because zero gradients (almost everywhere) z = g(ɛ, α) = arg max s K 1 (log α + ɛ) s = λ R θ,λ (z) = 0 Gumbel-Softmax Flows IWAE Gumbel-Softmax trick: replace arg max with softmax ( log α + ɛ ) z = softmax τ (where τ is a temperature term.) z k = exp((log α k + ɛ k )/τ) K j=1 exp((log α j + ɛ j )/τ) [ ( λ E q(z x; λ) [R θ,λ (z)] E ɛ Gumbel λ R θ,λ softmax ( log α + ɛ) )] τ 104/153
150 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Approaches a discrete distribution as τ 0 (anneal τ during training). Reparameterizable by construction Differentiable and has non-zero gradients Gumbel-Softmax Flows IWAE (from Maddison et al. [2017]) 105/153
151 Gumbel-Softmax Flows IWAE Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] See Maddison et al. [2017] on whether we can use the original categorical densities p(z), q(z), or need to use relaxed densities p GS (z), q GS (z). Requires that p(x z; θ) makes sense for non-discrete z (e.g. attention). Lower-variance, but biased gradient estimator. Variance as τ /153
152 Gumbel-Softmax Flows IWAE 5 Gumbel-Softmax Flows IWAE 6 107/153
153 Gumbel-Softmax Flows IWAE Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Recall log p(x; θ) = ELBO(θ, λ; x) KL[q(z x; λ) p(z x; θ)] Bound is tight when variational posterior equals true posterior q(z x; λ) = p(z x; θ) = log p(x; θ) = ELBO(θ, λ; x) We want to make q(z x; λ) as flexible as possible: can we do better than just Gaussian? 108/153
154 Gumbel-Softmax Flows IWAE Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Recall log p(x; θ) = ELBO(θ, λ; x) KL[q(z x; λ) p(z x; θ)] Bound is tight when variational posterior equals true posterior q(z x; λ) = p(z x; θ) = log p(x; θ) = ELBO(θ, λ; x) We want to make q(z x; λ) as flexible as possible: can we do better than just Gaussian? 108/153
155 Gumbel-Softmax Flows IWAE Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Idea: transform a sample from a simple initial variational distribution, z 0 q(z x; λ) = N (µ, σ 2 ) µ, σ 2 = enc(x; λ) into a more complex one z K = f K f 2 f 1 (z 0 ; λ) where f i (z i 1 ; λ) s are invertible transformations (whose parameters are absorbed by λ). 109/153
156 Gumbel-Softmax Flows IWAE Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Idea: transform a sample from a simple initial variational distribution, z 0 q(z x; λ) = N (µ, σ 2 ) µ, σ 2 = enc(x; λ) into a more complex one z K = f K f 2 f 1 (z 0 ; λ) where f i (z i 1 ; λ) s are invertible transformations (whose parameters are absorbed by λ). 109/153
157 Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Gumbel-Softmax Flows IWAE Sample from final variational posterior is given by z K. Density is given by the change of variables formula: log q K (z K x; λ) = log q(z 0 x; λ) + = log q(z 0 x; λ) }{{} log density of Gaussian k=1 K 1 f k log z k k=1 K log f k z }{{ k 1 } log determinant of Jacobian Determinant calculation is O(N 3 ) in general, but can be made faster depending on parameterization of f k 110/153
158 Gumbel-Softmax Flows IWAE Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Can still use reparameterization to obtain gradients. Letting F (z) = f K f 1 (z), [ p(x, z; θ) ] ELBO(θ, λ; x) = λ E qk (z K x; λ) log q K (z K x; λ) [ = λ E q(z0 x; λ) log p(x, F (z 0); θ) log F ] q(z 0 x; λ) z 0 [ ( = E ɛ N (0,I) λ log p(x, F (z 0); θ) log F )] q(z 0 x; λ) z 0 111/153
159 Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Gumbel-Softmax Flows IWAE Examples of f k (z k 1 ; λ) Normalizing Flows [Rezende and Mohamed 2015] f k (z k 1 ) = z k 1 + u k h(wk z k 1 + b k ) Inverse Autoregressive Flows [Kingma et al. 2016] f k (z k 1 ) = z k 1 σ k + µ k σ k,d = sigmoid(nn(z k 1,<d )) µ k,d = NN(z k 1,<d ) (In this case the Jacobian is upper triangular, so determinant is just the product of diagonals) 112/153
160 Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Gumbel-Softmax Flows IWAE (from Rezende and Mohamed [2015]) 113/153
161 Gumbel-Softmax Flows IWAE 5 Gumbel-Softmax Flows IWAE 6 114/153
162 Gumbel-Softmax Flows IWAE Importance Weighted Autoencoder (IWAE) [Burda et al. 2015] Flows are a way of tightening the ELBO by making the variational family more flexible. Not the only way: can obtain a tighter lower bound on log p(x; θ) by using Consider: multiple importance samples. I K = 1 K where z (1:K) K k=1 q(z(k) x; λ). K k=1 p(x, z (k) ; θ) q(z (k) x; λ), Note that I K is an unbiased estimator of p(x; θ): E q(z (1:K) x; λ) [I K] = p(x; θ). 115/153
163 Gumbel-Softmax Flows IWAE Importance Weighted Autoencoder (IWAE) [Burda et al. 2015] Flows are a way of tightening the ELBO by making the variational family more flexible. Not the only way: can obtain a tighter lower bound on log p(x; θ) by using Consider: multiple importance samples. I K = 1 K where z (1:K) K k=1 q(z(k) x; λ). K k=1 p(x, z (k) ; θ) q(z (k) x; λ), Note that I K is an unbiased estimator of p(x; θ): E q(z (1:K) x; λ) [I K] = p(x; θ). 115/153
164 Importance Weighted Autoencoder (IWAE) [Burda et al. 2015] Gumbel-Softmax Flows IWAE Any unbiased estimator of p(x; θ) can be used to obtain a lower bound, using Jensen s inequality: p(x; θ) = E q(z (1:K) x; λ) [I K] = log p(x; θ) E q(z (1:K) x; λ) [log I K] [ ] = E q(z (1:K) x; λ) log 1 K p(x, z (k) ; θ) K q(z (k) x; λ) However, can also show [Burda et al. 2015]: log p(x; θ) E [log I K ] E [log I K 1 ] k=1 lim K E [log I K ] = log p(x; θ) under mild conditions 116/153
165 Importance Weighted Autoencoder (IWAE) [Burda et al. 2015] Gumbel-Softmax Flows IWAE E q(z (1:K) x; λ) [ log 1 K K k=1 Note that with K = 1, we recover the ELBO. Can interpret p(x,z(k) ; θ) q(z (k) x; λ) as importance weights. ] p(x, z (k) ; θ) q(z (k) x; λ) If q(z x; λ) is reparameterizable, we can use the reparameterization trick to optimize E [log I K ] directly. Otherwise, need score function gradient estimators [Mnih and Rezende 2016]. 117/153
166 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 5 6 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 118/153
167 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 5 6 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 119/153
168 Sentence VAE Example [Bowman et al. 2016] Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Generative Model (Model 2): Draw z N (0, I) Draw x t z CRNNLM(θ, z) Model (Amortized): Deep Diagonal Gaussians, q(z x; λ) = N (µ, σ 2 ) h T = RNN(x; ψ) µ = W 1 ht σ 2 = exp(w 2 ht ) λ = {W 1, W 2, ψ} 120/153
169 Sentence VAE Example [Bowman et al. 2016] (from Bowman et al. [2016]) ɛ Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics λ z z x x 1... x T 121/153
170 Issue 1: Posterior Collapse Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics ELBO(θ, λ) = E q(z x; λ) [log p(x,z; θ) q(z x; λ) ] = E q(z x; λ) [log p(x z; θ)] KL[q(z x; λ) p(z)] }{{}}{{} Reconstruction likelihood Regularizer Model L/ELBO Reconstruction KL RNN LM RNN VAE (On Yahoo Corpus from Yang et al. [2017]) 122/153
171 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Issue 1: Posterior Collapse x and z become independent, and p(x, z; θ) reduces to a non-lv language model. Chen et al. [2017]: If it s possible to model p (x) without making use of z, then ELBO optimum is at: p (x) = p(x z; θ) = p(x; θ) q(z x; λ) = p(z) KL[q(z x; λ) p(z)] = 0 123/153
172 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Mitigating Posterior Collapse Use less powerful likelihood models [Miao et al. 2016; Yang et al. 2017], or word dropout [Bowman et al. 2016]. Model LL/ELBO Reconstruction KL RNN LM RNN VAE Word Drop CNN VAE (On Yahoo Corpus from Yang et al. [2017]) 124/153
173 Mitigating Posterior Collapse Gradually anneal multiplier on KL term, i.e. E q(z x; λ) [log p(x z; θ)] β KL[q(z x; λ) p(z)] β goes from 0 to 1 as training progresses Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics (from Bowman et al. [2016]) 125/153
174 Mitigating Posterior Collapse Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Other approaches: Use auxiliary losses (e.g. train z as part of a topic model) [Dieng et al. 2017; Wang et al. 2018] Use von Mises Fisher distribution with a fixed concentration parameter [Guu et al. 2017; Xu and Durrett 2018] Combine stochastic/amortized variational inference [Kim et al. 2018] Add skip connections [Dieng et al. 2018] In practice, often necessary to combine various methods. 126/153
175 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Issue 2: Evaluation ELBO always lower bounds log p(x; θ), so can calculate an upper bound on PPL efficiently. When reporting ELBO, should also separately report, KL[q(z x; λ) p(z)] to give an indication of how much the latent variable is being used. 127/153
176 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Issue 2: Evaluation Also can evaluate log p(x; θ) with importance sampling So [ p(x z; θ)p(z) ] p(x; θ) = E q(z x; λ) q(z x; λ) 1 K p(x z (k) ; θ)p(z (k) ) K q(z (k) x; λ) k=1 = log p(x; θ) log 1 K K k=1 p(x z (k) ; θ)p(z (k) ) q(z (k) x; λ) 128/153
177 Evaluation Qualitative evaluation Evaluate samples from prior/variational posterior. Interpolation in latent space. Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics (from Bowman et al. [2016]) 129/153
178 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 5 6 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 130/153
179 Encoder/Decoder [Sutskever et al. 2014; Cho et al. 2014] Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Given: Source information s = s 1,..., s M. Generative process: Draw x 1:T s CRNNLM(θ, enc(s)). 131/153
180 Latent, Per-token Experts [Yang et al. 2018] Generative process: For t = 1,..., T, Draw z t x <t, s softmax(uh t ). Draw x t z t, x <t, s softmax(w tanh(q zt h t ); θ) z (n) 1 z (n) 2 z (n) T Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics... x (n) x (n) 1 2 x (n) T N If U R K d, used K experts; increases the flexibility of per-token distribution. 132/153
181 Case-Study: Latent Per-token Experts [Yang et al. 2018] Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Learning: z t are independent given x <t, so we can marginalize at each time-step (Method 3: Conjugacy). Test-time: arg max log p(x s; θ) = θ T K arg max log p(z t =k s, x <t ; θ) p(x t z t =k, x <t, s; θ). θ arg max x 1:T t=1 k=1 T t=1 k=1 K p(z t =k s, x <t ; θ) p(x t z t =k, x <t, s; θ). 133/153
182 Case-Study: Latent, Per-token Experts [Yang et al. 2018] PTB language modeling results (s is constant): Model PPL Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Merity et al. [2018] Softmax-mixture [Yang et al. 2018] Dialogue generation results (s is context): Model BLEU Prec Rec No mixture Softmax-mixture [Yang et al. 2018] /153
183 Attention [Bahdanau et al. 2015] Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Decoding with an attention mechanism: M x t x <t, s softmax(w [h t, α t,m enc(s) m ]). m=1 135/153
184 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Copy Attention [Gu et al. 2016; Gulcehre et al. 2016] Copy attention models copying words directly from s. Generative process: For t = 1,..., T, Set α t to be attention weights. Draw z t x <t, s Bern(MLP([h t, enc(s)])). If z t = 0 Draw x t z t, x <t, s softmax(w h t ). Else Draw x t {s 1,..., s M } z t, x <t, s Cat(α t ). 136/153
185 Copy Attention Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Learning: Can maximize the log per-token marginal [Gu et al. 2016], as with per-token experts: Test-time: max log p(x 1,..., x T s; θ) θ T = max log p(z t = z s, x <t ; θ) p(x t z, x <t, x; θ). θ arg max x 1:T t=1 z {0,1} T t=1 z {0,1} p(z t =z s, x <t ; θ) p(x t z, x <t, s; θ). 137/153
186 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Attention as a Latent Variable [Deng et al. 2018] Generative process: For t = 1,..., T, Set α t to be attention weights. Draw z t x <t, s Cat(α t ). Draw x t z t, x <t, s softmax(w [h t, enc(s zt )]; θ). 138/153
187 Attention as a Latent Variable [Deng et al. 2018] Marginal likelihood under latent attention model: T M p(x 1:T s; θ) = α t,m softmax(w [h t, enc(s m )]; θ) xt. t=1 m=1 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Standard attention likelihood: T M p(x 1:T s; θ) = softmax(w [h t, α t,m enc(s m )]; θ) xt. t=1 m=1 139/153
188 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Attention as a Latent Variable [Deng et al. 2018] Learning Strategy #1: Maximize the log marginal via enumeration as above. Learning Strategy #2: Maximize the ELBO with AVI: max E q(z t; λ) [log p(x t x <t, z t, s)] KL[q(z t ; λ) p(z t x <t, s)]. λ,θ q(z t x; λ) approximates p(z t x 1:T, s; θ); implemented with a BLSTM. q isn t reparameterizable, so gradients obtained using REINFORCE + baseline. 140/153
189 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Attention as a Latent Variable [Deng et al. 2018] Test-time: Calculate p(x t x <t, s; θ) by summing out z t. MT Results on IWSLT-2014: Model PPL BLEU Standard Attn Latent Attn (marginal) Latent Attn (ELBO) /153
190 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Encoder/Decoder with Structured Latent Variables At least two EMNLP 2018 papers augment encoder/decoder text generation models with structured latent variables: 1 Lee et al. [2018] generate x 1:T by iteratively refining sequences of words z 1:T. 2 Wiseman et al. [2018] generate x 1:T conditioned on a latent template or plan z 1:S. 142/153
191 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 5 6 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 143/153
192 Summary as a Latent Variable [Miao and Blunsom 2016] Generative process for a document x = x 1,..., x T : Draw a latent summary z 1,..., z M RNNLM(θ) Draw x 1,..., x T z 1:M CRNNLM(θ, z) Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Posterior : p(z 1:M x 1:T ; θ) = p(summary document; θ). 144/153
193 Summary as a Latent Variable [Miao and Blunsom 2016] Generative process for a document x = x 1,..., x T : Draw a latent summary z 1,..., z M RNNLM(θ) Draw x 1,..., x T z 1:M CRNNLM(θ, z) Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Posterior : p(z 1:M x 1:T ; θ) = p(summary document; θ). 144/153
194 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Summary as a Latent Variable [Miao and Blunsom 2016] Learning: Maximize the ELBO with amortized family: max E q(z λ,θ 1:M ; λ) [log p(x 1:T z 1:M ; θ)] KL[q(z 1:M ; λ) p(z 1:M ; θ)] q(z 1:M ; λ) approximates p(z 1:M x 1:T ; θ); also implemented with encoder/decoder RNNs. q(z 1:M ; λ) not reparameterizable, so gradients use REINFORCE + baselines. 145/153
195 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Summary as a Latent Variable [Miao and Blunsom 2016] Semi-supervised Training: Can also use documents without corresponding summaries in training. Train q(z 1:M ; λ) p(z 1:M x 1:T ; θ) with labeled examples. Infer summary z for an unlabeled document with q. Use inferred z to improve model p(x 1:T z 1:M ; θ). Allows for outperforming strictly supervised models! 146/153
196 Topic [Blei et al. 2003] Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Generative process: for each document x (n) = x (n) 1,..., x(n) T, Draw topic distribution z (n) top Dir(α) For t = 1,..., T : Draw topic z (n) t Cat(z (n) top) Draw x t Cat(β (n) z ) t 147/153
13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationVariational Autoencoders
Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly
More informationLatent Variable Models
Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationDiscrete Latent Variable Models
Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 14 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 1 / 29 Summary Major themes in the course
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationStochastic Backpropagation, Variational Inference, and Semi-Supervised Learning
Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian
More informationarxiv: v2 [cs.cl] 1 Jan 2019
Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4
ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationNote Set 5: Hidden Markov Models
Note Set 5: Hidden Markov Models Probabilistic Learning: Theory and Algorithms, CS 274A, Winter 2016 1 Hidden Markov Models (HMMs) 1.1 Introduction Consider observed data vectors x t that are d-dimensional
More informationVariational Autoencoder
Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational
More informationConditional Random Field
Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationBayesian Semi-supervised Learning with Deep Generative Models
Bayesian Semi-supervised Learning with Deep Generative Models Jonathan Gordon Department of Engineering Cambridge University jg801@cam.ac.uk José Miguel Hernández-Lobato Department of Engineering Cambridge
More informationAuto-Encoding Variational Bayes
Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationA Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute
More informationVariational Autoencoders (VAEs)
September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p
More informationDeep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016
Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? Want to estimate some distribution, p*(x) p*(x) What is
More informationNatural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu
Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak
More informationLecture 15. Probabilistic Models on Graph
Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted
More informationBackpropagation Through
Backpropagation Through Backpropagation Through Backpropagation Through Will Grathwohl Dami Choi Yuhuai Wu Geoff Roeder David Duvenaud Where do we see this guy? L( ) =E p(b ) [f(b)] Just about everywhere!
More informationProbabilistic Graphical Models for Image Analysis - Lecture 4
Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.
More informationChapter 16. Structured Probabilistic Models for Deep Learning
Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationBACKPROPAGATION THROUGH THE VOID
BACKPROPAGATION THROUGH THE VOID Optimizing Control Variates for Black-Box Gradient Estimation 27 Nov 2017, University of Cambridge Speaker: Geoffrey Roeder, University of Toronto OPTIMIZING EXPECTATIONS
More informationPattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM
Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationVariational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain
Variational Inference in TensorFlow Danijar Hafner Stanford CS 20 2018-02-16 University College London, Google Brain Outline Variational Inference Tensorflow Distributions VAE in TensorFlow Variational
More informationMaster 2 Informatique Probabilistic Learning and Data Analysis
Master 2 Informatique Probabilistic Learning and Data Analysis Faicel Chamroukhi Maître de Conférences USTV, LSIS UMR CNRS 7296 email: chamroukhi@univ-tln.fr web: chamroukhi.univ-tln.fr 2013/2014 Faicel
More informationVariational Inference via Stochastic Backpropagation
Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are
More informationVariational Inference. Sargur Srihari
Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms
More informationChapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang
Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check
More information13 : Variational Inference: Loopy Belief Propagation and Mean Field
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction
More informationDeep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng
Deep Sequence Models Context Representation, Regularization, and Application to Language Adji Bousso Dieng All Data Are Born Sequential Time underlies many interesting human behaviors. Elman, 1990. Why
More informationUnsupervised Learning
CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationPart of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Variational Inference II: Mean Field Method and Variational Principle Junming Yin Lecture 15, March 7, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationINF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018
Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)
More informationCSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection
CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection
More informationVariational Bayes on Monte Carlo Steroids
Variational Bayes on Monte Carlo Steroids Aditya Grover, Stefano Ermon Department of Computer Science Stanford University {adityag,ermon}@cs.stanford.edu Abstract Variational approaches are often used
More informationVariational Scoring of Graphical Model Structures
Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational
More informationBayesian Deep Learning
Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference
More informationVariational Inference for Monte Carlo Objectives
Variational Inference for Monte Carlo Objectives Andriy Mnih, Danilo Rezende June 21, 2016 1 / 18 Introduction Variational training of directed generative models has been widely adopted. Results depend
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More informationMidterm sample questions
Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts
More informationLatent Dirichlet Allocation Introduction/Overview
Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationStatistical Methods for NLP
Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition
More informationConditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013
Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General
More informationRecurrent Latent Variable Networks for Session-Based Recommendation
Recurrent Latent Variable Networks for Session-Based Recommendation Panayiotis Christodoulou Cyprus University of Technology paa.christodoulou@edu.cut.ac.cy 27/8/2017 Panayiotis Christodoulou (C.U.T.)
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationVariational Inference (11/04/13)
STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further
More informationLearning Tetris. 1 Tetris. February 3, 2009
Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are
More informationPILCO: A Model-Based and Data-Efficient Approach to Policy Search
PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationGANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution
GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution Matt Kusner, Jose Miguel Hernandez-Lobato Satya Krishna Gorti University of Toronto Satya Krishna Gorti CSC2547 1 / 16 Introduction
More informationA Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement
A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP,
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationLecture 21: Spectral Learning for Graphical Models
10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation
More informationAdvances in Variational Inference
1 Advances in Variational Inference Cheng Zhang Judith Bütepage Hedvig Kjellström Stephan Mandt arxiv:1711.05597v1 [cs.lg] 15 Nov 2017 Abstract Many modern unsupervised or semi-supervised machine learning
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationHidden Markov Models
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework
More informationProbabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm
Probabilistic Graphical Models 10-708 Homework 2: Due February 24, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 4-8. You must complete all four problems to
More informationEnergy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13
Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 13 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 1 / 21 Summary Story so far Representation: Latent
More informationMathematical Formulation of Our Example
Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot
More informationCS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm
CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm Jan-Willem van de Meent, 19 November 2016 1 Hidden Markov Models A hidden Markov model (HMM)
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1
More informationStochastic Variational Inference
Stochastic Variational Inference David M. Blei Princeton University (DRAFT: DO NOT CITE) December 8, 2011 We derive a stochastic optimization algorithm for mean field variational inference, which we call
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationLearning Energy-Based Models of High-Dimensional Data
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationName: Student number:
UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL 2018 EXAMINATIONS CSC321H1S Duration 3 hours No Aids Allowed Name: Student number: This is a closed-book test. It is marked out of 35 marks. Please
More informationAuto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models
Auto-Encoding Variational Bayes Diederik Kingma and Max Welling Stochastic Backpropagation and Approximate Inference in Deep Generative Models Danilo J. Rezende, Shakir Mohamed, Daan Wierstra Neural Variational
More informationHidden Markov Models
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework
More informationIntroduction to Graphical Models
Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationNatural Gradients via the Variational Predictive Distribution
Natural Gradients via the Variational Predictive Distribution Da Tang Columbia University datang@cs.columbia.edu Rajesh Ranganath New York University rajeshr@cims.nyu.edu Abstract Variational inference
More information