arxiv: v2 [stat.ml] 15 Aug 2017

Similar documents
REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

Inference Suboptimality in Variational Autoencoders

INFERENCE SUBOPTIMALITY

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Nonparametric Inference for Auto-Encoding Variational Bayes

Variational Autoencoder

Training Deep Generative Models: Variations on a Theme

Natural Gradients via the Variational Predictive Distribution

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

DEBIASING EVIDENCE APPROXIMATIONS: ON IMPORTANCE-WEIGHTED AUTOENCODERS AND JACKKNIFE VARIATIONAL INFERENCE

Generative models for missing value completion

arxiv: v1 [stat.ml] 6 Dec 2018

The Success of Deep Generative Models

Auto-Encoding Variational Bayes

Approximate Bayesian inference

Denoising Criterion for Variational Auto-Encoding Framework

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Tighter Variational Bounds are Not Necessarily Better

13: Variational inference II

17 : Optimization and Monte Carlo Methods

LDA with Amortized Inference

Bayesian Semi-supervised Learning with Deep Generative Models

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Resampled Proposal Distributions for Variational Inference and Learning

Resampled Proposal Distributions for Variational Inference and Learning

Unsupervised Learning

Variational Autoencoders

Note 1: Varitional Methods for Latent Dirichlet Allocation

arxiv: v1 [cs.lg] 6 Dec 2018

Variational AutoEncoder: An Introduction and Recent Perspectives

GENERATIVE ADVERSARIAL LEARNING

Deep Generative Models

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme

Variational Inference for Monte Carlo Objectives

Probabilistic Graphical Models

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Variational Sequential Monte Carlo

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Variational Autoencoders (VAEs)

But if z is conditioned on, we need to model it:

CLOSE-TO-CLEAN REGULARIZATION RELATES

B PROOF OF LEMMA 1. Published as a conference paper at ICLR 2018

Variational Bayes on Monte Carlo Steroids

Deep latent variable models

An Introduction to Expectation-Maximization

Searching for the Principles of Reasoning and Intelligence

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Fault Detection and Diagnosis Using Information Measures

Variational Inference (11/04/13)

UvA-DARE (Digital Academic Repository) Improving Variational Auto-Encoders using Householder Flow Tomczak, J.M.; Welling, M. Link to publication

Machine Learning Lecture Notes

arxiv: v4 [stat.co] 19 May 2015

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Variational Inference via χ Upper Bound Minimization

Markov Chain Monte Carlo and Variational Inference: Bridging the Gap

arxiv: v3 [stat.ml] 30 Jun 2017

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University

Lecture 13 : Variational Inference: Mean Field Approximation

Quasi-Monte Carlo Flows

doubly semi-implicit variational inference

Variance Reduction in Black-box Variational Inference by Adaptive Importance Sampling

Adversarial Sequential Monte Carlo

Mixture Models and Expectation-Maximization

Variational Inference with Stein Mixtures

Gray-box Inference for Structured Gaussian Process Models: Supplementary Material

Gaussian Mixture Model

arxiv: v3 [stat.ml] 3 Apr 2017

Gradient Estimation Using Stochastic Computation Graphs

CSC321 Lecture 20: Autoencoders

Latent Variable Models

Introduction to Machine Learning

STA 4273H: Statistical Machine Learning

Technical Details about the Expectation Maximization (EM) Algorithm

Expectation Propagation in Dynamical Systems

Improving Variational Inference with Inverse Autoregressive Flow

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

arxiv: v5 [cs.lg] 11 Jul 2018

Black-box α-divergence Minimization

arxiv: v2 [cs.ne] 22 Feb 2013

RAO-BLACKWELLIZED PARTICLE FILTER FOR MARKOV MODULATED NONLINEARDYNAMIC SYSTEMS

Image recognition based on hidden Markov eigen-image models using variational Bayesian method

Operator Variational Inference

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Variational Inference via Stochastic Backpropagation

On the Quantitative Analysis of Decoder-Based Generative Models

arxiv:submit/ [stat.ml] 12 Nov 2017

Lecture 2: Correlated Topic Model

REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models

Sequential Monte Carlo in the machine learning toolbox

Notes on pseudo-marginal methods, variational Bayes and ABC

Bayesian Deep Learning

arxiv: v1 [cs.lg] 15 Jun 2016

Variational Auto Encoders

Variational Inference with Copula Augmentation

arxiv: v3 [stat.ml] 15 Mar 2018

Facilitating Bayesian Continual Learning by Natural Gradients and Stein Gradients

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Wild Variational Approximations

Transcription:

Worshop trac - ICLR 207 REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS Chris Cremer, Quaid Morris & David Duvenaud Department of Computer Science University of Toronto {ccremer,duvenaud}@cs.toronto.edu {quaid.morris}@utoronto.ca arxiv:704.0296v2 [stat.ml] 5 Aug 207 ABSTRACT The standard interpretation of importance-weighted autoencoders is that they maximie a tighter lower bound on the marginal lielihood than the standard evidence lower bound. We give an alternate interpretation of this procedure: that it optimies the standard variational lower bound, but using a more complex distribution. We formally derive this result, present a tighter lower bound, and visualie the implicit importance-weighted distribution. BACKGROUND The importance-weighted autoencoder IWAE; Burda et al. 206)) is a variational inference strategy capable of producing arbitrarily tight evidence lower bounds. IWAE maximies the following multisample evidence lower bound ELBO): log px) E... q x) [ log i )] px, i ) L IW AE [q] IWAE ELBO) q i x) which is a tighter lower bound than the ELBO maximied by the variational autoencoder VAE; Kingma & Welling 204)): )] px, ) log px) E q x) [log L V AE [q]. VAE ELBO) q x) 2 DEFINING THE IMPLICIT DISTRIBUTION q IW In this section, we derive the implicit distribution that arises from importance sampling from a distribution p using q as a proposal distribution. Given a batch of samples 2... from q x), the following is the unnormalied importance-weighted distribution: q IW x, 2: ) px,) q x) px, j) j q j x) q x) px, ) px,) q x) + ) ) px, j) j2 q j x) True posterior 0 00 Figure : Approximations to a complex true distribution, defined via q EW. As grows, this approximation approaches the true distribution.

Worshop trac - ICLR 207 Here are some properties of the approximate IWAE posterior: When, q IW x, 2: ) equals q x). When >, the form of q IW x, 2: ) depends on the true posterior p x). As, E 2... [ q IW x, 2: )] approaches the true posterior p x) pointwise. See the appendix for details. Importantly, q IW x, 2: ) is dependent on the batch of samples 2.... See Fig. 3 in the appendix for a visualiation of q IW with different batches of 2.... 2. RECOVERING THE IWAE BOUND FROM THE VAE BOUND Here we show that the IWAE ELBO is equivalent to the VAE ELBO in expectation, but with a more flexible, unnormalied q IW distribution, implicitly defined by importance reweighting. If we replace q x) with q IW x, 2: ) and tae an expectation over 2..., then we recover the IWAE ELBO: [ ) ] px, ) E 2... q x) [L V AE [ q IW 2: )]] E 2... q x) q IW 2: ) log d q IW x, 2: ) [ ) ] px, i ) E 2... q x) q IW 2: ) log d q i i x) [ )] px, i ) E... q x) log L IW AE [q] q i x) For a more detailed derivation, see the appendix. Note that we are abusing the VAE lower bound notation because this implies an expectation over an unnormalied distribution. Consequently, we replace the expectation with an equivalent integral. i 2.2 EXPECTED IMPORTANCE WEIGHTED DISTRIBUTION q EW We can achieve a tighter lower bound than L IW AE [q] by taing the expectation over 2... of q IW. The expected importance-weighted distribution q EW x) is a distribution given by: q EW x) E 2... q x) [ q IW x, 2: )] E 2... px, ) q x) px,) q x) + ) px, j) j2 q j x) 2) See section 5.2 for a proof that q EW is a normalied distribution. Using q EW in the VAE ELBO, L V AE [q EW ], results in an upper bound of L IW AE [q]. See section 5.3 for the proof, which is a special case of the proof in Naesseth et al. 207). The procedure to sample from q EW x) is shown in Algorithm. It is equivalent to sampling-importance-resampling SIR). 2.3 VISUALIZING THE NONPARAMETERIC APPROXIMATE POSTERIOR The IWAE approximating distribution is nonparametric in the sense that, as the true posterior grows more complex, so does the shape of q IW and q EW. This maes plotting these distributions challenging. A ernel-density-estimation approach could be used, but requires many samples. Thanfully, equations ) and 2) give us a simple and fast way to approximately plot q IW and q EW without introducing artifacts due to ernel density smoothing. Figure visualies q EW on a 2D distribution approximation problem using Algorithm 2. The base distribution q is a Gaussian. As we increase the number of samples and eep the base distribution fixed, we see that the approximation approaches the true distribution. See section 5.6 for D visualiations of q IW and q EW with 2. 3 RESAMPLING FOR PREDICTION During training, we sample the q distribution and implicitly weight them with the IWAE ELBO. After training, we need to explicitly reweight samples from q. 2

Worshop trac - ICLR 207 Algorithm Sampling q EW x) : number of importance samples 2: for i in... do 3: i q x) 4: w i px,i) q i x) 5: Each w i w i / i w i 6: j Categorical w) 7: Return j Algorithm 2 Plotting q EW x) : number of importance samples 2: S number of function samples 3: L locations to plot 4: ˆf eros L ) 5: for s in...s do 6: 2... q x) 7: ˆpx) i2 8: for in L do px, i) q i x) 9: ˆf[] + px,) 0: Return ˆf/S px,) +ˆpx)) q x) Real Sample q x) Sample q EW x) Figure 2: Reconstructions of MNIST samples from q x) and q EW. The model was trained by maximiing the IWAE ELBO with K50 and 2 latent dimensions. The reconstructions from q x) are greatly improved with the sampling-resampling step of q EW. In figure 2, we demonstrate the need to sample from q EW rather than q x) for reconstructing MNIST digits. We trained the model to maximie the IWAE ELBO with K50 and 2 latent dimensions, similar to Appendix C in Burda et al. 206). When we sample from q x) and reconstruct the samples, we see a number of anomalies. However, if we perform the sampling-resampling step Alg. ), then the reconstructions are much more accurate. The intuition here is that we trained the model with q EW with K 50 then sampled from q x) q EW with K ), which are very different distributions, as seen in Fig.. 4 DISCUSSION Bachman & Precup 205) also showed that the IWAE objective is equivalent to stochastic variational inference with a proposal distribution corrected towards the true posterior via normalied importance sampling. We build on this idea by further examining q IW and by providing visualiations to help better grasp the interpretation. To summarie our observations, the following is the ordering of lower bounds given specific proposal distributions, log px) L V AE [q EW ] E 2... q x) [L V AE [ q IW 2: )]] L IW AE [q] L V AE [q] In light of this, IWAE can be seen as increasing the complexity of the approximate distribution q, similar to other methods that increase the complexity of q, such as Normaliing Flows Jimene Reende & Mohamed, 205), Variational Boosting Miller et al., 206) or Hamiltonian variational inference Salimans et al., 205). With this interpretation in mind, we can possibly generalie q IW to be applicable to other divergence measures. An interesting avenue of future wor is the comparison of IW-based variational families with alpha-divergences or operator variational objectives. 3

Worshop trac - ICLR 207 ACKNOWLEDGMENTS We d lie to than an anonymous ICLR reviewer for providing insightful future directions for this wor. We d lie to than Yuri Burda, Christian Naesseth, and Scott Linderman for bringing our attention to oversights in the paper. We d also lie to than Christian Naesseth for the derivation in section 5.3 and for providing many helpful comments. REFERENCES Philip Bachman and Doina Precup. Training Deep Generative Models: Variations on a Theme. NIPS Approximate Inference Worshop, 205. Yuri Burda, Roger Grosse, and Ruslan Salahutdinov. Importance weighted autoencoders. In ICLR, 206. Danilo Jimene Reende and Shair Mohamed. Variational Inference with Normaliing Flows. In ICML, 205. Diederi P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In ICLR, 204. Andrew C. Miller, Nicholas Foti, and Ryan P. Adams. Variational Boosting: Iteratively Refining Posterior Approximations. Advances in Approximate Bayesian Inference, NIPS Worshop, 206. C. A. Naesseth, S. W. Linderman, R. Ranganath, and D. M. Blei. Variational Sequential Monte Carlo. ArXiv preprint, 207. Tim Salimans, Diederi P. Kingma, and Max Welling. Marov chain monte carlo and variational inference: Bridging the gap. In ICML, 205. 4

Worshop trac - ICLR 207 5 APPENDIX 5. DETAILED DERIVATION OF THE EQUIVALENCE OF VAE AND IWAE BOUND Here we show that the expectation over 2... of the VAE lower bound with the unnomalied importance-weighted distribution q IW, L VAE [ q IW 2: )], is equivalent to the IWAE bound with the original q distribution, L IWAE [q]. E 2... q x)[l VAE [ q IW 2: )]] E 2... q x) [ E 2... q x) E 2... q x) E 2... q x) [ [ E 2... q x) E 2... q x) E... q x) E... q x) E... q x) q IW x, 2: ) log q IW x, 2: ) log q IW x, 2: ) log q IW x, 2: ) log px,) q x) px, j) j q j x) j px, ) q x) px, j) q j x) px, ) q x) px, j) j q j x) px, j) j q j x) px, j) j q j x) [ log i log log px, i ) q i x) px, ) q IW x, 2: ) px, ) px,) px, i ) i q i x) i ) ] d 3) d ) px, i ) d q i x) 4) ] 5) ) ] px, i ) d q i x) i q x) log q x) log )] i i i i 6) px, i ) q i x) 7) px, i ) q i x) 8) ) px, i ) 9) q i x) px, i ) q i x) ) 0) ) L IWAE [q] 2) 8): Change of notation. 0): i has the same expectation as so we can replace with the sum of terms. ) d ) d 5

Worshop trac - ICLR 207 5.2 PROOF THAT q EW IS A NORMALIZED DISTRIBUTION q EW x)d E 2... q x) [ q IW x, 2: )] d 3) E 2... px, ) q x) px,) q x) + ) d 4) px, j) j2 q j x) q x) q x) E 2... px, ) q x) px,) q x) + ) d 5) px, j) j2 q j x) E q x) E 2... q x) E... q x) E... q x) i E... q x) px, ) q x) j i E... q x) j px,) px,) q x) q x) + px, j) q j x) px, ) q x) px, j) j q j x) px, i) q i x) px, j) j q j x) px, i) q i x) px, j) q j x) j2 px, j) q j x) ) 6) ) 7) 8) 9) 20) E... q x) [] 2) 22) 7): Change of notation. 9): i has the same expectation as so we can replace with the sum of terms. 20): Linearity of expectation. 6

Worshop trac - ICLR 207 5.3 PROOF THAT L V AE [q EW ] IS AN UPPER BOUND OF L IW AE [q] Proof provided by Christian Naesseth. Let ˆpx : ) px,) q x) + j2 px, j) q j x) )] px, ) L V AE [q EW ] E qew [log q EW x) E qew log px, ) E qew log ) E q2: x) E q2: x) [ px,) ˆpx : ) [ ˆpx : ) 23) ] 24) ] 25) [ E qew log Eq2: x) [ˆpx : ) ])] 26) px, )E q2: x) [ˆpx : ) ] log E q2: x) [ˆpx : ) ]) d 27) px, )E q2: x) [ˆpx : ) log ˆpx : ) )] d 28) px, ) q 2: x)ˆpx : ) log ˆpx : ) ) d 29) 2: q x) : q x) px, )q 2: x)ˆpx : ) log ˆpx : ) ) d 30) px, ) : q x) q : x)ˆpx : ) log ˆpx : ) ) d 3) px, ) q x) : ˆpx : ) q : x)log ˆpx : )) d 32) px, ) q x) q px, : x)log px, j ) d 33) j) : q j q j x) j j x) px, i) q i x) q px, : x)log px, j ) d 34) j) i : q j q j x) j j x) px, i) i q i x) q px, : x)log px, j ) d 35) j) : q j q j x) j j x) q : x)log px, j ) d 36) : q j j x) E q: ) log px, j ) 37) q j x) j L IW AE [q] 38) 28): Given that fa) AlogA is concave for A > 0, and fe[x]) E[fx)], then fe[x]) E[x]logE[x] E[ xlogx]. 30): Change of notation. 34): i has the same expectation as so we can replace with the sum of terms. 7

Worshop trac - ICLR 207 5.4 PROOF THAT q EW IS CLOSER TO THE TRUE POSTERIOR THAN q The previous section showed that L IW AE q) L V AE q EW ). That is, the IWAE ELBO with the base q is a lower bound to the VAE ELBO with the importance weighted q EW. Due to Jensen's inequality and as shown in Burda et al. 206), we now that the IWAE ELBO is an upper bound of the VAE ELBO: L IW AE q) L V AE q). Furthermore, the log marginal lielihood can be factoried into: logpx)) L V AE q) + KLq p), and rearranged to: KLq p) logpx)) L V AE q). Following the observations above and substituting q EW for q: KLq EW p) logpx)) L V AE [q EW ] 39) logpx)) L IW AE [q] 40) logpx)) L V AE [q] KLq p) 4) Thus, KLq EW p) KLq p), meaning q EW is closer to the true posterior than q in terms of KL divergence. 5.5 IN THE LIMIT OF THE NUMBER OF SAMPLES Another perspective is in the limit of. Recall that the marginal lielihood can be approximated by importance sampling: [ ] px, ) px) E q x) px, i ) 42) q x) q i x) where i is sampled from q x). We see that the denominator of q IW is approximating px). If px,) q x) is bounded, then it follows from the strong law of large numbers that, as approaches infinity, q IW converges to the true posterior p x) almost surely. This interpretation becomes clearer when we factor out the true posterior from q IW : q IW x, 2: ) i px) px,) q x) + )p x) 43) px, j) j2 q j x) We see that the closer the denominator becomes to px), the closer q IW is to the true posterior. 5.6 VISUALIZING q IW AND q EW IN D Figure 3: Visualiation of D q IW and q EW distributions. The blue p) distribution and the green q) distribution are both normalied. The three instances of q IW 2 ) 2) have different 2 samples from q) and we can see that they are unnormalied. q EW ) is normalied and is the expectation over 30 q IW 2 ) distributions. The q IW distributions were plotted using Algorithm 2 with S. 8