Deep latent variable models

Similar documents
arxiv: v1 [stat.ml] 6 Dec 2018

Variational Autoencoder

Probabilistic Reasoning in Deep Learning

Generative models for missing value completion

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Auto-Encoding Variational Bayes

Nonparametric Inference for Auto-Encoding Variational Bayes

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Latent Variable Models

Variational Autoencoders

Inference Suboptimality in Variational Autoencoders

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

Variational Autoencoders (VAEs)

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Unsupervised Learning

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

GENERATIVE ADVERSARIAL LEARNING

Bayesian Semi-supervised Learning with Deep Generative Models

Probabilistic Graphical Models

Deep Generative Models. (Unsupervised Learning)

Deep Feedforward Networks

STA 4273H: Statistical Machine Learning

Variational Inference via Stochastic Backpropagation

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Denoising Criterion for Variational Auto-Encoding Framework

Lecture 16 Deep Neural Generative Models

STA 414/2104: Lecture 8

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Graphical Models for Collaborative Filtering

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Scalable Non-linear Beta Process Factor Analysis

Variational Bayes on Monte Carlo Steroids

Lecture : Probabilistic Machine Learning

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

FAST ADAPTATION IN GENERATIVE MODELS WITH GENERATIVE MATCHING NETWORKS

The Success of Deep Generative Models

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Bayesian Methods for Machine Learning

Appendices: Stochastic Backpropagation and Approximate Inference in Deep Generative Models

CPSC 540: Machine Learning

STA 4273H: Statistical Machine Learning

Machine Learning Summer School

Latent Variable Models and EM Algorithm

Pattern Recognition and Machine Learning

13: Variational inference II

CSC321 Lecture 5: Multilayer Perceptrons

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Machine Learning

Natural Gradients via the Variational Predictive Distribution

Probabilistic & Unsupervised Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

STA 414/2104: Machine Learning

Probabilistic & Bayesian deep learning. Andreas Damianou

STA 4273H: Statistical Machine Learning

STA 414/2104: Lecture 8

Latent Variable Models and EM algorithm

Week 5: Logistic Regression & Neural Networks

CPSC 540: Machine Learning

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 6: Graphical Models: Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Logistic Regression. COMP 527 Danushka Bollegala

Probabilistic & Unsupervised Learning

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Unsupervised Learning

Linear Models for Classification

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

17 : Optimization and Monte Carlo Methods

SGVB Topic Modeling. by Otto Fabius. Supervised by Max Welling. A master s thesis for MSc in Artificial Intelligence. Track: Learning Systems

The Poisson transform for unnormalised statistical models. Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB)

Variational Inference. Sargur Srihari

Estimating Unnormalized models. Without Numerical Integration

Statistical learning. Chapter 20, Sections 1 4 1

Introduction to Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Variational AutoEncoder: An Introduction and Recent Perspectives

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Classification. Sandro Cumani. Politecnico di Torino

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

arxiv: v1 [cs.lg] 15 Jun 2016

Lecture 3 Feedforward Networks and Backpropagation

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

STA414/2104 Statistical Methods for Machine Learning II

Deep Latent-Variable Models of Natural Language

BACKPROPAGATION THROUGH THE VOID

Introduction to Bayesian inference

Cheng Soon Ong & Christian Walder. Canberra February June 2018

arxiv: v2 [stat.ml] 15 Aug 2017

Basic math for biology

Variational Auto Encoders

Lecture 1b: Linear Models for Regression

arxiv: v8 [stat.ml] 3 Mar 2014

Transcription:

Deep latent variable models Pierre-Alexandre Mattei IT University of Copenhagen http://pamattei.github.io @pamattei 19 avril 2018 Séminaire de statistique du CNAM 1

Overview of talk A short introduction to deep learning Deep latent variable models On the boundedness of the likelihood of deep latent variable models Handling missing data in deep latent variable models 2

But actually, what is deep learning? Deep learning is a general framework for function approximation. 3

But actually, what is deep learning? Deep learning is a general framework for function approximation. It uses parametric approximators called neural networks, which are compositions of some tunable affine functions f 1,..., f L with a simple fixed nonlinear function σ: F(x) = f 1 σ f 2... σ f L (x) These functions are called layers. The nonlinearity σ is usually called the activation function. 3

But actually, what is deep learning? Deep learning is a general framework for function approximation. It uses parametric approximators called neural networks, which are compositions of some tunable affine functions f 1,..., f L with a simple fixed nonlinear function σ: F(x) = f 1 σ f 2... σ f L (x) These functions are called layers. The nonlinearity σ is usually called the activation function. The derivatives of F with respect to the tunable parameters can be computed using the chain rule via the backpropagation algorithm. 3

A glimpse at the zoology of layers The simplest kind of affine layer is called a fully connected layer: f l (x) = W l x + b l, where W l and b l are tunable parameters. 4

A glimpse at the zoology of layers The simplest kind of affine layer is called a fully connected layer: f l (x) = W l x + b l, where W l and b l are tunable parameters. The activation function σ is usually a univariate fixed function applied elementwise. Here are two popular choices: 4 σ(x) 2 Hyperbolic tangent Restricted linear unit (ReLU) 0 5.0 2.5 0.0 2.5 5.0 x 4

Why is it convenient to compose affine functions? Neural nets are powerful approximators: any continuous function can be arbitrarily well approximated on a compact using a three-layer fully connected network F = f 1 σ f 2 (universal approximation theorem, Cybenko, 1989, Hornik, 1991). 5

Why is it convenient to compose affine functions? Neural nets are powerful approximators: any continuous function can be arbitrarily well approximated on a compact using a three-layer fully connected network F = f 1 σ f 2 (universal approximation theorem, Cybenko, 1989, Hornik, 1991). Some prior knowledge can be distilled into the architecture (i.e. the type of affine functions/activations) of the network. For example, convolutional neural networks (convnets, LeCun, 1989) leverage the fact that local information plays an important role in images/sound/sequence data. In that case, the affine functions are convolution operators with some learnt filters. 5

Why is it convenient to compose affine functions? Neural nets are powerful approximators: any continuous function can be arbitrarily well approximated on a compact using a three-layer fully connected network F = f 1 σ f 2 (universal approximation theorem, Cybenko, 1989, Hornik, 1991). Some prior knowledge can be distilled into the architecture (i.e. the type of affine functions/activations) of the network. For example, convolutional neural networks (convnets, LeCun, 1989) leverage the fact that local information plays an important role in images/sound/sequence data. In that case, the affine functions are convolution operators with some learnt filters. When the neural network parametrises a regression function, empirical evidence shows that adding more layers leads to better out-of-sample behaviour. Roughly, this means that adding more layers is a way of increasing the complexity of statistical models without paying a large overfitting price: there is a regularisation-by-depth effect. 5

A simple example: nonlinear regression with a multilayer perceptron (MLP) We want to perform regression on a data set (x 1, y 1 ),..., (x n, y n ) R p R. 6

A simple example: nonlinear regression with a multilayer perceptron (MLP) We want to perform regression on a data set (x 1, y 1 ),..., (x n, y n ) R p R. We can model the regression function using a multilayer perceptron (MLP): two connected layers with an hyperbolic tangent in-between: i n, y i = F(x i ) + ε i = W 1 tanh(w 0 x i + b 0 ) + b 1 + ε i. The coordinates of the intermediate representation W 0 x i + b 0 are called hidden units. 6

A simple example: nonlinear regression with a multilayer perceptron (MLP) We want to perform regression on a data set (x 1, y 1 ),..., (x n, y n ) R p R. We can model the regression function using a multilayer perceptron (MLP): two connected layers with an hyperbolic tangent in-between: i n, y i = F(x i ) + ε i = W 1 tanh(w 0 x i + b 0 ) + b 1 + ε i. The coordinates of the intermediate representation W 0 x i + b 0 are called hidden units. If we assume that the noise is Gaussian, then we can find the maximum likelihood estimates of W 1, W 0, b 1, b 0 by minimising the squared error using gradient descent. Gradients are computed via backpropagation. 6

A simple example: nonlinear regression with a multilayer perceptron (MLP) i n, y i = F(x i ) + ε i = W 1 tanh(w 0 x i + b 0 ) + b 1 + ε i, Let s try to recover the function sin(x)/x using 20 samples: 0.8 0.8 5 hidden units 20 hidden units 0.4 0.4 0.0 0.0 10 0 10 10 0 10 7

Overview of talk A short introduction to deep learning Deep latent variable models On the boundedness of the likelihood of deep latent variable models Handling missing data in deep latent variable models 8

Continuous latent variable models A generative model p(x) describes a process that is assumed to give rise to some data (D. MacKay). In a continuous latent variable model we assume that there is an unobserved random variable z R d. Usually, d is smaller than the dimensionality of the data, and we can think of z as a code summarizing multivariate data x. A classic example: factor analysis. The generative process is: z N (0, I d ), x z N (Wz + µ, Ψ). 9

Deep latent variable models (DLVMs) Deep latent variable models combine the approximation abilities of deep neural networks and the statistical foundations of generative models. Independently invented by Kingma and Welling (2014), as variational autoencoders, and Rezende et al. (2014) as deep latent Gaussian models. 10

Deep latent variable models (DLVMs) (Kingma and Welling, 2014; Rezende et al., 2014; Mattei and Frellsen, 2018) Assume that (x i, z i ) i n are i.i.d. random variables driven by the model: { z p(z) (prior) x p θ (x z) (output density) z x θ where z R d is the latent variable, n x X is the observed variable. 11

Deep latent variable models (DLVMs) (Kingma and Welling, 2014; Rezende et al., 2014; Mattei and Frellsen, 2018) Assume that (x i, z i ) i n are i.i.d. random variables driven by the model: { z p(z) (prior) x p θ (x z) = Φ(x f θ (z)) (output density) z x θ where z R d is the latent variable, n x X is the observed variable, the function f θ : R d H is a (deep) neural network called the decoder, and (Φ( η)) η H is a parametric family of output densities, e.g. multivariate Gaussians or products of multinomials. 11

The role of the decoder The role of the decoder f θ : R d H is: to transform z (the code) into parameters η = f θ (z) of the output density Φ( η). The weights θ of the decoder are learned. An illustrative example of a simple non-linear decoder is z N (0, I 2 ) and f (z) = z/10 + z/ z. Image from Doersch (2016) 12

DLVMs applications: density estimation on MNIST (Rezende et al., 2014) Training data Model samples 13

DLVMs applications: density estimation on (Brendan) Frey faces (Rezende et al., 2014) Training data Model samples 14

DLVMs applications: Data imputation (Rezende et al., 2014) 15

Learning DLVMs (Kingma and Welling, 2014; Rezende et al., 2014; Mattei and Frellsen, 2018) Given a data matrix X = (x 1,..., x n ) X n, the log-likelihood function for a DLVM is where l(θ) = log p θ (X) = n log p θ (x i ), i=1 p θ (x i ) = p θ (x i z)p(z) dz. R d We would like to find the MLE ˆθ = argmax θ l(θ). : 16

Learning DLVMs (Kingma and Welling, 2014; Rezende et al., 2014; Mattei and Frellsen, 2018) Given a data matrix X = (x 1,..., x n ) X n, the log-likelihood function for a DLVM is where l(θ) = log p θ (X) = n log p θ (x i ), i=1 p θ (x i ) = p θ (x i z)p(z) dz. R d We would like to find the MLE ˆθ = argmax θ l(θ). However, even with a simple output density p θ (x z): p θ (x) is intractable rendering MLE intractable 16

Learning DLVMs (Kingma and Welling, 2014; Rezende et al., 2014; Mattei and Frellsen, 2018) Given a data matrix X = (x 1,..., x n ) X n, the log-likelihood function for a DLVM is where l(θ) = log p θ (X) = n log p θ (x i ), i=1 p θ (x i ) = p θ (x i z)p(z) dz. R d We would like to find the MLE ˆθ = argmax θ l(θ). However, even with a simple output density p θ (x z): p θ (x) is intractable rendering MLE intractable p θ (z x) is intractable rendering EM intractable 16

Learning DLVMs (Kingma and Welling, 2014; Rezende et al., 2014; Mattei and Frellsen, 2018) Given a data matrix X = (x 1,..., x n ) X n, the log-likelihood function for a DLVM is where l(θ) = log p θ (X) = n log p θ (x i ), i=1 p θ (x i ) = p θ (x i z)p(z) dz. R d We would like to find the MLE ˆθ = argmax θ l(θ). However, even with a simple output density p θ (x z): p θ (x) is intractable rendering MLE intractable p θ (z x) is intractable rendering EM intractable stochastic EM is not scalable to large n and moderate d. 16

Variational Inference (VI) (Kingma and Welling, 2014; Rezende et al., 2014; Blei et al., 2017) VI approximatively maximises the log-likelihood by maximising the evidence lower bound [ ELBO(θ, q) = E z q log p ] θ(x, Z) = l(θ) KL(q p θ ( X)) l(θ) q(x) wrt. (θ, q), where q D is variational distribution for a family of distributions D over the space of codes R n d, X = (x 1,..., x n ) X n and Z = (z 1,..., z n ) R n d. 17

Variational Inference (VI) (Kingma and Welling, 2014; Rezende et al., 2014; Blei et al., 2017) VI approximatively maximises the log-likelihood by maximising the evidence lower bound [ ELBO(θ, q) = E z q log p ] θ(x, Z) = l(θ) KL(q p θ ( X)) l(θ) q(x) wrt. (θ, q), where q D is variational distribution for a family of distributions D over the space of codes R n d, X = (x 1,..., x n ) X n and Z = (z 1,..., z n ) R n d. However, computing a distribution over R n d is to costly for large datasets. 17

Amortised Variational Inference (AVI) (Kingma and Welling, 2014; Rezende et al., 2014; Gershman and Goodman, 2014) Amortised inference scales up VI by learning a function g that transform each data point into the parameters of the approximate posterior q γ,x (Z) = n Ψ(z i g γ (x i )), i=1 where (Ψ( κ)) κ K is is a parametric family of distributions over R d (usually Gaussians), g γ : X K is a neural net called the inference network. Inference for DLVMs solves the optimisation problem max θ Θ,γ Γ ELBO(θ, q γ,x), γ z θ DLVM with AVI is denote a variational autoencoder (VAE). x n 18

Our contributions (Mattei and Frellsen, 2018) On the boundedness of the likelihood of deep latent variable models: We show that maximum likelihood is ill-posed for DLVMs with Gaussian outputs. We propose how to tackle this problem using constraints. We show that maximum likelihood is well-posed for DLVMs with discrete outputs. We provide a way of finding an upper bound of the likelihood. Handling missing data in deep latent variable models: For missing data at test time, we show how to draw samples according to the exact conditional distribution of the missing data. 19

Overview of talk A short introduction to deep learning Deep latent variable models On the boundedness of the likelihood of deep latent variable models Handling missing data in deep latent variable models 20

On the boundedness of the likelihood of DLVMs (Mattei and Frellsen, 2018) If we see the prior as a mixing distribution, DLVMs are continuous mixtures of the output distribution. But ML for finite Gaussian mixtures is ill-posed: the likelihood function is unbounded and the parameters with infinite likelihood are pretty terrible. Mixtures, like tequila, are inherently evil and should be avoided at all costs Larry Wasserman 21

On the boundedness of the likelihood of DLVMs (Mattei and Frellsen, 2018) If we see the prior as a mixing distribution, DLVMs are continuous mixtures of the output distribution. But ML for finite Gaussian mixtures is ill-posed: the likelihood function is unbounded and the parameters with infinite likelihood are pretty terrible. Mixtures, like tequila, are inherently evil and should be avoided at all costs Larry Wasserman Hence the questions: Is the likelihood function of DLVMs with Gaussian outputs bounded above? Do we really want a very tight ELBO? 21

On the boundedness of the likelihood of DLVMs (Mattei and Frellsen, 2018) Consider a DLVM with p-variate Gaussian outputs where l(θ) = n i=1 log N (x µ θ (z), Σ θ (z))p(z) dz. R d Like Kingma and Welling (2014), consider a MLP decoder with h N hidden units of the form µ θ (z) = V tanh(wz + a) + b Σ θ (z) = exp(α tanh(wz + a) + β)i p where θ = (W, a, V, b, α, β). Image from http://deeplearning.net/tutorial/mlp.html 22

On the boundedness of the likelihood of DLVMs (Mattei and Frellsen, 2018) Consider a DLVM with p-variate Gaussian outputs where l(θ) = n i=1 log N (x µ θ (z), Σ θ (z))p(z) dz. R d Like Kingma and Welling (2014), consider a MLP decoder with h N hidden units of the form µ θ (z) = V tanh(wz + a) + b Σ θ (z) = exp(α tanh(wz + a) + β)i p where θ = (W, a, V, b, α, β). Now consider a subfamily with h = 1 and θ (i,w) k = (α k w, 0, 0, x i, α k, α k ), where (α k ) k 1 is a nonnegative real sequence Image from http://deeplearning.net/tutorial/mlp.html 22

On the boundedness of the likelihood of DLVMs (Mattei and Frellsen, 2018) Consider a DLVM with p-variate Gaussian outputs where l(θ) = n i=1 log N (x µ θ (z), Σ θ (z))p(z) dz. R d Like Kingma and Welling (2014), consider a MLP decoder with h N hidden units of the form µ θ (z) = V tanh(wz + a) + b µ (i,w) θ (z) = x i k Σ θ (z) = exp(α tanh(wz + a) + β)i p Σ θ (i,w) k (z) = exp(α k tanh(α k w z) α k )I p where θ = (W, a, V, b, α, β). Now consider a subfamily with h = 1 and θ (i,w) k = (α k w, 0, 0, x i, α k, α k ), where (α k ) k 1 is a nonnegative real sequence Image from http://deeplearning.net/tutorial/mlp.html 22

On the boundedness of the likelihood of DLVMs (Mattei and Frellsen, 2018) Theorem For all i ({1,.. )., n} and w R d \ {0}, we have that lim k l =. θ (i,w) k Proof main idea: the contribution log p θ (i,w) k (x i ) of the i-th observation explodes while all other contributions remain bounded below. 23

On the boundedness of the likelihood of DLVMs (Mattei and Frellsen, 2018) Theorem For all i ({1,.. )., n} and w R d \ {0}, we have that lim k l =. θ (i,w) k Proof main idea: the contribution log p (i,w) θ (x i ) of the i-th observation k explodes while all other contributions remain bounded below. Do these infinite suprema lead to useful generative models? Proposition For all k N, i {1,..., n} and w R d \ {0}, the distribution (x i ) is is spherically symmetric and unimodal around x i. p θ (i,w) k No, because of the constant mean function. 23

On the boundedness of the likelihood of DLVMs (Mattei and Frellsen, 2018) Theorem For all i ({1,.. )., n} and w R d \ {0}, we have that lim k l =. θ (i,w) k Proof main idea: the contribution log p (i,w) θ (x i ) of the i-th observation k explodes while all other contributions remain bounded below. Do these infinite suprema lead to useful generative models? Proposition For all k N, i {1,..., n} and w R d \ {0}, the distribution (x i ) is is spherically symmetric and unimodal around x i. p θ (i,w) k No, because of the constant mean function. What about other parametrisations? The used MLP is a subfamily. Universal approximation abilities of neural networks. 23

Tackling the unboundedness of the likelihood (Mattei and Frellsen, 2018) Proposition Let ξ > 0. If the parametrisation of the decoder is such that the image of Σ θ is included in S ξ p = {A S + p min(sp A) ξ} for all θ, then the log-likelihood function is upper bounded by np log πξ Note: Such constraints can be implemented by added ξi p to Σ θ (z). 24

Discrete DLVMs do not suffer from unbounded likelihood When X = {0, 1} p, Bernoulli DLVMs assume that (Φ( η)) η H is the family of p-variate multivariate Bernoulli distribution (that is, the family of products of p univariate Bernoulli distributions). In this case, maximum likelihood is well-posed. Proposition Given any possible parametrisation, the log-likelihood function of a deep latent model with Bernoulli outputs is everywhere negative. 25

Towards data-dependent likelihood upper bounds (Mattei and Frellsen, 2018) We can interpret DLVM as parsimonious submodel of a nonparametric mixture model p G (x) = H Φ(x η) dg(η) l(g) = n log p G (x i ). The model parameter is the mixing distribution G P, where P is the set of all probability measures over parameter space H. This is a DLVM, when G is generatively defined by: z p(z); η = f θ (z). This is a finite mixture model, when G is has a finite support. i=1 This generalisation bridges the gap between finite mixtures and DLVMs. 26

Towards data-dependent likelihood upper bounds (Mattei and Frellsen, 2018; Cremer et al., 2018) This gives us an immediate upper bound on the likelihood for any decoder f θ : l(θ) max G P l(g) 27

Towards data-dependent likelihood upper bounds (Mattei and Frellsen, 2018; Cremer et al., 2018) This gives us an immediate upper bound on the likelihood for any decoder f θ : Theorem l(θ) max G P l(g) Assume that (Φ( η)) η H is the family of multivariate Bernoulli distributions or Gaussian distributions with the spectral constraint. The likelihood of the nonparametric mixture model is maximised for a finite mixture model of k n distributions from the family (Φ( η)) η H. 27

Unboundedness for a DLVM with Gaussian outputs (Frey faces) 28

Overview of talk A short introduction to deep learning Deep latent variable models On the boundedness of the likelihood of deep latent variable models Handling missing data in deep latent variable models 29

Data imputation with variational autoencoders (rezende2014; Mattei and Frellsen, 2018) After training a couple encoder/decoder, we consider a new data point x = (x obs, x miss ). In principle we can impute x miss using p θ (x miss x obs ) = Φ(x miss x obs, f θ (z))p(z x obs ) dz. R d Since (31) is intractable, Rezende et al. (2014) suggested using pseudo-gibbs sampling, by forming a Markov chain (z t, ˆx miss t ) t 1 z t Ψ(z i g γ (x obs, ˆx miss t 1 )) ˆx miss t Φ(x miss x obs, f θ (z))p(z t ) γ z x θ n 30

Data imputation with variational autoencoders (rezende2014; Mattei and Frellsen, 2018) After training a couple encoder/decoder, we consider a new data point x = (x obs, x miss ). In principle we can impute x miss using p θ (x miss x obs ) = Φ(x miss x obs, f θ (z))p(z x obs ) dz. R d Since (31) is intractable, Rezende et al. (2014) suggested using pseudo-gibbs sampling, by forming a Markov chain (z t, ˆx miss t ) t 1 z t Ψ(z i g γ (x obs, ˆx miss t 1 )) ˆx miss t Φ(x miss x obs, f θ (z))p(z t ) We propose Metropolis-within-Gibbs: 30

Comparing pseudo-gibbs and Metropolis-within-Gibbs (Mattei and Frellsen, 2018) Network architectures from Rezende et al. (2014) with 200 hidden units and intrinsic dimension of 50. Both samples use the same trained VAE. Perform the same number of iterations (300). 31

Comparing pseudo-gibbs and Metropolis-within-Gibbs (Mattei and Frellsen, 2018) Network architectures from Rezende et al. (2014) with 200 hidden units and intrinsic dimension of 50. Both samples use the same trained VAE. Perform the same number of iterations (500). 32

Summary DLVMs are highly flexible generative models. We showed that MLE is ill-posed for unconstrained DLVMs with Gaussian output. We propose how to tackle this problem using constraints. We provided an upper bound for the likelihood in well-posed cases. We showed how to draw samples according to the exact conditional distribution with missing data. 33

Summary DLVMs are highly flexible generative models. We showed that MLE is ill-posed for unconstrained DLVMs with Gaussian output. We propose how to tackle this problem using constraints. We provided an upper bound for the likelihood in well-posed cases. We showed how to draw samples according to the exact conditional distribution with missing data. Thank you for your attention! Questions? 33

References Blei, D. M., A. Kucukelbir, and J. D. McAuliffe (2017). Variational Inference: A Review for Statisticians. In: Journal of the American Statistical Association 112.518, pp. 859 877. Cremer, C., X. Li, and D. Duvenaud (2018). Inference Suboptimality in Variational Autoencoders. In: arxiv:1801.03558. Day, N. E. (1969). Estimating the Components of a Mixture of Normal Distributions. In: Biometrika 56.3, pp. 463 474. Doersch, C. (2016). Tutorial on variational autoencoders. In: arxiv:1606.05908. Frellsen, J., I. Moltke, M. Thiim, K. V. Mardia, J. Ferkinghoff-Borg, and T. Hamelryck (2009). A Probabilistic Model of RNA Conformational Space. In: PLOS Computational Biology 5.6, pp. 1 11. Gershman, S. and N. Goodman (2014). Amortized inference in probabilistic reasoning. In: Proceedings of the Annual Meeting of the Cognitive Science Society. Vol. 36. Kingma, D. P. and M. Welling (2014). Auto-encoding variational Bayes. In: International Conference on Learning Representations. MacKay, D. J. (2003). Information theory, inference and learning algorithms. Cambridge University Press. Mattei, P.-A. and J. Frellsen (2018). Leveraging the Exact Likelihood of Deep Latent Variables Models. In: arxiv:1802.04826. 34

References Mohamed, S. and D. Rezende (2017). Tutorial on Deep Generative Models. UAI 2017. Rezende, D. J., S. Mohamed, and D. Wierstra (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In: Proceedings of the 31st International Conference on Machine Learning, pp. 1278 1286. 35