REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

Similar documents
arxiv: v2 [stat.ml] 15 Aug 2017

Inference Suboptimality in Variational Autoencoders

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Training Deep Generative Models: Variations on a Theme

Natural Gradients via the Variational Predictive Distribution

Auto-Encoding Variational Bayes

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

INFERENCE SUBOPTIMALITY

Denoising Criterion for Variational Auto-Encoding Framework

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Nonparametric Inference for Auto-Encoding Variational Bayes

Variational Autoencoder

17 : Optimization and Monte Carlo Methods

Variational Autoencoders (VAEs)

Variational Autoencoders

GENERATIVE ADVERSARIAL LEARNING

DEBIASING EVIDENCE APPROXIMATIONS: ON IMPORTANCE-WEIGHTED AUTOENCODERS AND JACKKNIFE VARIATIONAL INFERENCE

The Success of Deep Generative Models

Latent Variable Models

Deep Generative Models

Variational Inference for Monte Carlo Objectives

arxiv: v1 [stat.ml] 6 Dec 2018

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

13: Variational inference II

B PROOF OF LEMMA 1. Published as a conference paper at ICLR 2018

Variational AutoEncoder: An Introduction and Recent Perspectives

Resampled Proposal Distributions for Variational Inference and Learning

Resampled Proposal Distributions for Variational Inference and Learning

Deep latent variable models

Bayesian Semi-supervised Learning with Deep Generative Models

arxiv: v4 [stat.co] 19 May 2015

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Tighter Variational Bounds are Not Necessarily Better

STA 4273H: Statistical Machine Learning

Variational Bayes on Monte Carlo Steroids

Markov Chain Monte Carlo and Variational Inference: Bridging the Gap

arxiv: v1 [cs.lg] 6 Dec 2018

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Unsupervised Learning

Variational Inference via Stochastic Backpropagation

Citation for published version (APA): Kingma, D. P. (2017). Variational inference & deep learning: A new synthesis

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Lecture 13 : Variational Inference: Mean Field Approximation

Approximate Bayesian inference

UvA-DARE (Digital Academic Repository) Improving Variational Auto-Encoders using Householder Flow Tomczak, J.M.; Welling, M. Link to publication

Variational Inference with Copula Augmentation

Probabilistic Graphical Models

VAE Learning via Stein Variational Gradient Descent

Quasi-Monte Carlo Flows

Variational Inference (11/04/13)

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Variational Auto Encoders

CSC321 Lecture 20: Autoencoders

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Appendices: Stochastic Backpropagation and Approximate Inference in Deep Generative Models

An Introduction to Expectation-Maximization

AdaGAN: Boosting Generative Models

Variational Auto-Encoders (VAE)

Variational Sequential Monte Carlo

Probabilistic Graphical Models for Image Analysis - Lecture 4

Variational Inference with Stein Mixtures

Lecture 14: Deep Generative Learning

arxiv: v3 [cs.lg] 17 Nov 2017

arxiv: v5 [cs.lg] 11 Jul 2018

Introduction to Machine Learning

FAST ADAPTATION IN GENERATIVE MODELS WITH GENERATIVE MATCHING NETWORKS

On the Quantitative Analysis of Decoder-Based Generative Models

Probabilistic Reasoning in Deep Learning

arxiv: v3 [stat.ml] 3 Apr 2017

Generative models for missing value completion

Variational Autoencoders for Classical Spin Models

Deep Latent-Variable Models of Natural Language

Introduction to Machine Learning

Posterior Regularization

Improving Variational Inference with Inverse Autoregressive Flow

LDA with Amortized Inference

Variational inference

Wild Variational Approximations

Operator Variational Inference

Variational Autoencoders. Presented by Alex Beatson Materials from Yann LeCun, Jaan Altosaar, Shakir Mohamed

arxiv: v1 [cs.lg] 15 Jun 2016

Bayesian Inference and MCMC

Gradient Estimation Using Stochastic Computation Graphs

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University

Evaluating the Variance of

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Variational Inference via χ Upper Bound Minimization

CSC411 Fall 2018 Homework 5

Variational Inference. Sargur Srihari

Lecture 16 Deep Neural Generative Models

Streaming Variational Bayes

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems

Instructor: Dr. Volkan Cevher. 1. Background

Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference

Latent Variable Models and EM algorithm

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

doubly semi-implicit variational inference

Variational Dropout via Empirical Bayes

Transcription:

Worshop trac - ICLR 207 REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS Chris Cremer, Quaid Morris & David Duvenaud Department of Computer Science University of Toronto {ccremer,duvenaud}@cs.toronto.edu {quaid.morris}@utoronto.ca ABSTRACT The standard interpretation of importance-weighted autoencoders is that they maximize a tighter lower bound on the marginal lielihood. We give an alternate interpretation of this procedure: that it optimizes the standard variational lower bound, but using a more complex distribution. We formally derive this result, and visualize the implicit importance-weighted approximate posterior. BACKGROUND The importance-weighted autoencoder (IWAE; Burda et al. (206)) maximizes the following multisample evidence lower bound (ELBO): ( )] p(x, z i ) log(p(x)) E z...z q(z x) log (IWAE ELBO) q(z i x) which is a tighter lower bound than the ELBO maximized by the variational autoencoder (VAE; Kingma & Welling (204)): ( ) ] p(x, zi ) log(p(x)) E z...z q(z x) log. (VAE ELBO) q(z i x) i= Here we ve written the VAE bound as a multisample lower bound to compare it the IWAE bound. The following equations are the gradients of the VAE ELBO and the IWAE ELBO, respectively: ( ) ] p(x, Θ L V AE (x) = E z...z q(z x) zi ) Θlog () q(z i= i x) ( ) ] p(x, zi ) Θ L IW AE (x) = E z...z q(z x) w i Θ log (2) q(z i x) i= i= True posterior q IW with = q IW with = 0 q IW with = 00 Figure : Approximations to a complex true distribution, defined via sampling-importanceresampling. As grows, this approximation approaches the true distribution.

Worshop trac - ICLR 207 where w i = q(z i x) j= p(x,z j) q(z j x) From equations and 2, we see that the gradient of the VAE ELBO evenly weights the samples, whereas the IWAE gradient weights the samples based on their relative importance w i.. 2 DEFINING THE IMPLICIT DISTRIBUTION Q IW In this section, we derive the implicit distribution that arises from importance sampling from a distribution p using q as a proposal distribution. Given a batch of samples z...z from q(z x), the following is the importance weighted q IW distribution as a function of one of the samples, z i : q(z q IW (z i x, z \i ) = w i q(z i x) = i x) p(x, z i ) q(z i x) = (3) j= p(x,z j) q(z j x) p(x,z j) j= q(z j x) The marginal distribution q IW (z x) is given by: q IW (z x) = E z...z q(z x) qiw (z i x, z \i ) ] for any i. (4) When =, then q IW = q(z x). As soon as >, we see that the form of q IW depends on the true posterior p. When =, q IW (z x) becomes the true posterior p(z x). See the Appendix for details. Figure visualizes q IW on a 2D distribution approximation problem. The base distribution q is a Gaussian. As we increase the number of samples used for the sampling-resampling, the approximation approaches the true distribution. This distribution is nonparametric in the sense that, as the true posterior grows more complex, so does the shape of q IW. 2. RECOVERING THE IWAE BOUND FROM THE VAE BOUND Here we show that the IWAE ELBO is equivalent to the VAE ELBO, but with a more flexible q(z x) distribution, implicitly defined by importance reweighting. First, we start by writing the VAE ELBO in its minibatch form, as an average over samples: )] log p(x) L V AE q] = E z q(z x) log ( p(x, z) q(z x) l= = E z...z q(z x) i= log ( ) ] p(x, zi ) q(z x) If we now set q(z) = q IW (z), then we recover the IWAE ELBO: ( ) ] p(x, z i ) L V AE q IW ] = E z...z q IW (z x) log (6) q i= IW (z i x, z \i ) = E z...z q(z x) w l log p(x, z i) (7) = E z...z q(z x) log i= j= p(x,z j ) q(z j x) (5) p(x, z j ) = L IW AE (8) q(z j x) Thus we see that VAE with q IW is equivalent to the IWAE ELBO. For a more detailed derivation, see the Appendix. j= 3 SAMPLING Q IW The procedure to sample from q IW (z x) is shown in Algorithm. It is equivalent to samplingimportance-resampling (SIR). 2

Worshop trac - ICLR 207 Algorithm Sampling from q IW : number of samples 2: q(z x) = f φ (x) 3: for i in... do 4: z i q(z x) 5: w i = p(x,zi) q(z i x) 6: Each w = w i / i= w i 7: j Cat( w) 8: Return z j Figure 2: Algorithm defines the procedure to sample from q IW. 4 RESAMPLING FOR PREDICTION During training, we sample the q distribution and implicitly weight them with the IWAE ELBO. After training, we need to explicitly reweight samples from q. Figure 3: Reconstructions of MNIST samples from q(z x) and q IW. The model was trained by maximizing the IWAE ELBO with K=50 and 2 latent dimensions. The reconstructions from q(z x) are greatly improved with the sampling-resampling step of q IW. In Fig. 3, we demonstrate the need to sample from q IW rather than q(z x) for reconstructing MNIST digits. We trained the model to maximize the IWAE ELBO with K=50 and 2 latent dimensions, similar to Appendix C in Burda et al. (206). When we sample from q(z x) and reconstruct the samples, we see a number of anomalies. However, if we perform the sampling-resampling step (Algo. ), then the reconstructions are much more accurate. The intuition here is that we trained the model with q IW with K = 50 then sampled from q(z x) (q IW with K = ), which are very different distributions, as seen in Fig.. 5 DISCUSSION Bachman & Precup (205) also showed that the IWAE objective is equivalent to stochastic variational inference with a proposal distribution corrected towards the true posterior via normalized importance sampling. In other words, the IWAE lower bound can be interpreted as the standard VAE lower bound with an implicit q IW distribution. We build on this idea by further examining q IW and by providing visualizations to help better grasp the interpretation. In light of this, IWAE can be seen as increasing the complexity of the approximate distribution q, similar to other methods that increase the complexity of q, such as Normalizing Flows (Jimenez Rezende & Mohamed (205)), Variational Boosting (Miller et al. (206)) or Hamiltonian variational inference (Salimans et al. (205)). 3

Worshop trac - ICLR 207 REFERENCES Philip Bachman and Doina Precup. Training Deep Generative Models: Variations on a Theme. NIPS Approximate Inference Worshop, 205. Yuri Burda, Roger Grosse, and Ruslan Salahutdinov. Importance weighted autoencoders. In ICLR, 206. Danilo Jimenez Rezende and Shair Mohamed. Variational Inference with Normalizing Flows. In ICML, 205. Diederi P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In ICLR, 204. Andrew C. Miller, Nicholas Foti, and Ryan P. Adams. Variational Boosting: Iteratively Refining Posterior Approximations. Advances in Approximate Bayesian Inference, NIPS Worshop, 206. Tim Salimans, Diederi P. Kingma, Max Welling, et al. Marov chain monte carlo and variational inference: Bridging the gap. In ICML, 205. 6 APPENDIX 6. DETAILED DERIVATION OF EQUIVALENCE OF VAE AND IWAE BOUND. First, we start by writing the VAE ELBO in its minibatch form, as an average over samples: ( )] p(x, z) ( ) log p(x) L V AE = E z q(z x) log ] p(x, zi ) = E z...z q(z x) q(z x) log q(z x) If we now set q(z) = q IW (z), then we recover the IWAE ELBO: ( ) ] p(x, z i ) = E z...z q IW (z x) log (0) q i= IW (z i x, z \i ) = E z...z q IW (z x) log p(x, z i) () i= j= p(x,z j ) q(z j x) = E z...z q IW (z x) log p(x, z j ) (2) q(z i= j= j x) = E z...z q IW (z x) log p(x, z j ) (3) q(z j= j x) = E z...z q(z x) w l log p(x, z j ) (4) q(z l= j= j x) = E z...z q(z x) log p(x, z j ) = L IW AE (5) q(z j x) Eqn. 3 follows Eqn. 2 since index i is not present within the sum over i. Similarly, from Eqn. 4 to Eqn. 5, index l is not present within the sum over l and l= w l sums to one. Eqn. 5 is the IWAE ELBO, thus we see that VAE with q IW is equivalent to the IWAE ELBO. j= i= (9) 4

Worshop trac - ICLR 207 6.2 = Recall that the marginal liehood can be approximated by: ] p(x, z) p(x) = E q(z x) q(z x) i p(x, z i ) q(z i x) (6) where z i is sampled from q(z i x). Thus, when = : q IW (z x) = p(x, z) ] = E p(x,z) q(z x) q(z x) p(x, z) p(x) = p(z x) (7) Thus q IW (z) is equal to the true posterior p(z x) when =, as expected. 6.3 PROOF THAT q IW IS CLOSER TO TRUE POSTERIOR THAN q Section 6.2 showed that L IW AE (q) = L V AE (q IW ). That is, the IWAE ELBO with the base q is equivalent to the VAE ELBO with the importance weighted q IW. Due to Jensen's Inequality and as shown in Burda et al. (206), we now that the IWAE ELBO is an upper bound of the VAE ELBO: L IW AE (q) L V AE (q). Finally, the log marginal lielihood can be factorized into: log(p(x)) = L V AE (q) + KL(q p), and rearranged to: KL(q p) = log(p(x)) L V AE (q). Following the observations above and substituting q for q IW : KL(q IW p) = log(p(x)) L V AE (q IW ) (8) = log(p(x)) L IW AE (q) (9) log(p(x)) L V AE (q) = KL(q p) (20) Thus, KL(q IW p) KL(q p), meaning q IW is closer to the true posterior than q in terms of KL divergence. 7 VISUALIZING Q IW IN D We can loo at the intermediate variational distributions with different numbers of samples in dimension. Fig. 4 demonstrates how the approximate posterior approaches the true posterior as increases. 5

Worshop trac - ICLR 207 Figure 4: Visualization of the importance weighted posterior. The blue distribution is the intractable distribution that we are trying to approximate. The green distribution is the variational distribution. The variational distributions of a, b, and c were optimized via SVI, whereas d, e, and f were optimized with SVI with the IWAE ELBO. The red histograms are importance weighted samples from the variational distribution. 6