Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Similar documents
On the Quantitative Analysis of Decoder-Based Generative Models

Inference Suboptimality in Variational Autoencoders

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

Probabilistic Graphical Models

Approximate Inference using MCMC

Variational Autoencoders (VAEs)

Nonparametric Bayesian Methods (Gaussian Processes)

Auto-Encoding Variational Bayes

STA 4273H: Statistical Machine Learning

Latent Variable Models

STA 4273H: Sta-s-cal Machine Learning

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Measuring the reliability of MCMC inference with bidirectional Monte Carlo

INFERENCE SUBOPTIMALITY

Annealing Between Distributions by Averaging Moments

STA 4273H: Statistical Machine Learning

arxiv: v2 [stat.ml] 15 Aug 2017

Variational Autoencoder

STA 414/2104: Machine Learning

Introduction to Machine Learning

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Density estimation. Computing, and avoiding, partition functions. Iain Murray

Non-Parametric Bayes

GENERATIVE ADVERSARIAL LEARNING

arxiv: v1 [stat.ml] 6 Dec 2018

Denoising Criterion for Variational Auto-Encoding Framework

Natural Gradients via the Variational Predictive Distribution

A Unified View of Deep Generative Models

Lecture 16 Deep Neural Generative Models

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge

The Success of Deep Generative Models

STA414/2104 Statistical Methods for Machine Learning II

Generative models for missing value completion

Variational Dropout via Empirical Bayes

Variational Autoencoders

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Variational Inference (11/04/13)

Unsupervised Learning

17 : Optimization and Monte Carlo Methods

Bayesian Semi-supervised Learning with Deep Generative Models

Variational Scoring of Graphical Model Structures

Deep Generative Models. (Unsupervised Learning)

Bayesian Inference and MCMC

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Probabilistic Graphical Models

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

Lecture 14: Deep Generative Learning

Paul Karapanagiotidis ECO4060

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

arxiv: v1 [cs.lg] 6 Dec 2018

Measuring the reliability of MCMC inference with bidirectional Monte Carlo

Generating Sentences by Editing Prototypes

Variational AutoEncoder: An Introduction and Recent Perspectives

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Pattern Recognition and Machine Learning

Lecture : Probabilistic Machine Learning

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,

Density Estimation. Seungjin Choi

Probabilistic Graphical Models

Variational Autoencoders for Classical Spin Models

13: Variational inference II

Bayesian Methods for Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

The connection of dropout and Bayesian statistics

Variational Inference via Stochastic Backpropagation

Lecture 13 : Variational Inference: Mean Field Approximation

Evaluation Methods for Topic Models

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Kernel adaptive Sequential Monte Carlo

Expectation Maximization

Probabilistic & Bayesian deep learning. Andreas Damianou

Estimating the marginal likelihood with Integrated nested Laplace approximation (INLA)

Variational Bayes on Monte Carlo Steroids

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

14 : Mean Field Assumption

Deep Generative Models

MCMC Sampling for Bayesian Inference using L1-type Priors

Approximate inference in Energy-Based Models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Bayesian Hidden Markov Models and Extensions

Unsupervised Learning

Bayesian Regression Linear and Logistic Regression

CSC321 Lecture 20: Reversible and Autoregressive Models

20: Gaussian Processes

Kernel Sequential Monte Carlo

Introduction to Graphical Models

arxiv: v4 [stat.co] 19 May 2015

Introduction to Gaussian Processes

Deep latent variable models

Quantitative Biology II Lecture 4: Variational Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Transcription:

Sandwiching the marginal likelihood using bidirectional Monte Carlo Roger Grosse Ryan Adams Zoubin Ghahramani

Introduction When comparing different statistical models, we d like a quantitative criterion which trades off model complexity and fit to the data In a Bayesian setting, we often use marginal likelihood Defined as the probability of the data, with all parameters and latent variables integrated out Motivation: plug into Bayes Rule p(m i D)= p(m i) p(d M i ) j p(m j) p(d M j )

Introduction: marginal likelihood T G M + G M + G need to integrate out all of the component matrices and their hyperparameters

Introduction Advantages of marginal likelihood (ML) Accounts for model complexity in a sophisticated way Closely related to description length Measures the model s ability to generalize to unseen examples ML is used in those rare cases where it is tractable e.g. Gaussian processes, fully observed Bayes nets Unfortunately, it s typically very hard to compute because it requires a very high-dimensional integral While ML has been criticized on many fronts, the proposed alternatives pose similar computational difficulties

Introduction Focus on latent variable models parameters, latent variables, observations z y assume i.i.d. observations Marginal likelihood requires summing or integrating out latent variables and parameters p(y) = p( ) Similar to computing the partition function N i=1 z i p(z i ) p(y i z i, )d Z = x X f(x)

Introduction Problem: exact marginal likelihood computation is intractable There are many algorithms to approximate it, but we don t know how well they work

Why evaluating ML estimators is hard The answer to life, the universe, and everything is... 42

Why evaluating ML estimators is hard The marginal likelihood is log p(d) = 23814.7

Why evaluating ML estimators is hard How does one deal with this in practice? polynomial-time approximations for partition functions of ferromagnetic Ising models test on very small instances which can be solved exactly run a bunch of estimators and see if they agree with each other

Log-ML lower bounds One marginal likelihood estimator is simple importance sampling: { (k), z (k) } K k=1 q ˆp(D) = 1 K This is an unbiased estimator E[ˆp(D)] = p(d) Unbiased estimators are stochastic lower bounds K k=1 p( (k), z (k), D) q( (k), z (k) ) (Jensen s inequality) (Markov s inequality) E[log ˆp(D)] apple log p(d) Pr(log ˆp(D) > log p(d)+b) apple e b Many widely used algorithms have the same property!

Log-ML lower bounds annealed importance sampling (AIS) True value? sequential Monte Carlo (SMC) Chib-Murray-Salakhutdinov variational Bayes

How to obtain an upper bound? Harmonic Mean Estimator: K { (k), z (k) } K ˆp(D) = k=1 p(, z D) K k=1 1/p(D (k), z (k) ) Equivalent to simple importance sampling, but with the role of the proposal and target distributions reversed Unbiased estimate of the reciprocal of the ML 1 E = 1 ˆp(D) p(d) Gives a stochastic upper bound on the log-ml Caveat 1: only an upper bound if you sample exactly from the posterior, which is generally intractable Caveat 2: this is the Worst Monte Carlo Estimator (Neal, 2008)

Annealed importance sampling (Neal, 2001)... p 0 p 1 p 2 p 3 p 4 p K 1 p K tractable initial distribution (e.g. prior) intractable target distribution (e.g. posterior)

Annealed importance sampling Given: x f 0 w =1 (Neal, 2001) unnormalized distributions f 0,...,f K MCMC transition operators T 0,...,T K f 0 easy to sample from, compute partition function of For i =0,...,K 1 w := w f i+1(x) f i (x) x : T i+1 (x) Then, E[w] = Z K Z0 Ẑ K = Z 0 S SX s=1 w (s)

Annealed importance sampling (Neal, 2001) T 1 T 2 T 3 T 4 x 0 x 1 x 2 x 3 x 4 p 0 p 1 p 2 p 3 p 4 T 1 T2 T3 T4... T K x K 1 x K p K 1 p T K K Forward: w := KY i=1 f i (x i 1 ) f i 1 (x i 1 ) = Z K q back (x 0, x 1,...,x K ) Z 0 q fwd (x 0, x 1,...,x K ) E[w] = Z K Z 0 Backward: w := KY i=1 f i 1 (x i ) f i (x i ) = Z 0 Z K q fwd (x 0, x 1,...,x K ) q back (x 0, x 1,...,x K ) E[w] = Z 0 Z K

Bidirectional Monte Carlo Initial distribution: prior Target distribution: posterior Partition function: Forward chain p(, z) p(, z D)= p(, z, D) p(d) Z = p(, z, D)d dz = p(d) E[w] = Z K Z 0 = p(d) stochastic lower bound Backward chain (requires exact posterior sample!) E[w] = Z 0 Z K = 1 p(d) stochastic upper bound

Bidirectional Monte Carlo How to get an exact sample? Two ways to sample from p(, z, D) p(, z) p(d, z) forward sample D z p(d) p(, z D) generate data, then perform inference Therefore, the parameters and latent variables used to generate the data are an exact posterior sample!

Bidirectional Monte Carlo Summary of algorithm:, z p,z y p y,z (, z ) Obtain a stochastic lower bound on log p(y) by running AIS forwards Obtain a stochastic upper bound on starting from (, z ) log p(y) by running AIS backwards, The two bounds will converge given enough intermediate distributions.

Experiments BDMC lets us compute ground truth log-ml values for data simulated from a model We can use these ground truth values to benchmark log-ml estimators! Obtained ground truth ML for simulated data for clustering low rank approximation binary attributes Compared a wide variety of ML estimators MCMC operators shared between all algorithms wherever possible

Results: binary attributes harmonic mean estimator true Bayesian information criterion (BIC) Likelihood weighting

Results: binary attributes true variational Bayes Chib-Murray- Salakhutdinov

Results: binary attributes (zoomed in) reverse AIS reverse SMC true nested sampling annealed importance sampling (AIS) sequential Monte Carlo

Results: binary attributes Which estimators give accurate results? likelihood weighting harmonic mean mean squared error variational Bayes nested sampling Chib-Murray-Salakhutdinov accuracy needed to distinguish simple matrix factorizations AIS sequential Monte Carlo (SMC) time (seconds)

Results: low rank approximation annealed importance sampling (AIS)

Recommendations Try AIS first If AIS is too slow, try sequential Monte Carlo or nested sampling Can t fix a bad algorithm by averaging many samples Don t trust naive confidence intervals -- need to evaluate rigorously

On the quantitative evaluation of decoder-based generative models Yuhuai Wu Yuri Burda Ruslan Salakhutdinov

Decoder-based generative models Define a generative process: sample latent variables z from a simple (fixed) prior p(z) pass them through a decoder network to get x = f(z) Examples: variational autoencoders (Kingma and Welling, 2014) generative adversarial networks (Goodfellow et al., 2014) generative moment matching networks (Li et al., 2015; Dziugaite et al., 2015) nonlinear independent components estimation (Dinh et al., 2015)

Decoder-based generative models Variational autoencoder (VAE) Train both a generator (decoder) and a recognition network (encoder) Optimize a variational lower bound on the log-likelihood Generative adversarial network (GAN) Train a generator (decoder) and a discriminator Discriminator wants to distinguish model samples from the training data Generator wants to fool the discriminator Generative moment matching network (GMMN) Train a generative network such that certain statistics match between the generated samples and the data

Decoder-based generative models Some impressive-looking samples: Denton et al. (2015) Radford et al. (2016) But how well do these models capture the distribution?

Decoder-based generative models Looking at samples can be misleading:

Decoder-based generative models GAN, 10 dim GAN, 50 dim, 200 epochs GAN, 50 dim, 1000 epochs LLD = 328.7 LLD = 543.5 LLD = 625.5

Evaluating decoder-based models Want to quantitatively evaluate generative models in terms of the probability of held-out data Problem: a GAN or GMMN with k latent dimensions can only generate within a k-dimensional submanifold! Standard (but unsatisfying) solution: impose a spherical Gaussian observation model p (x z) =N (f(z), I) tune on a validation set Problem: this still requires computing an intractable integral: p (x) = p(z) p (x z)dz

Evaluating decoder-based models For some models, we can tractably compute log-likelihoods, or at least a reasonable lower bound Tractable likelihoods for models with reversible decoders (e.g. NICE) Variational autoencoders: ELBO lower bound log p(x) E q(z x) [log p(x z)] D KL (q(z x) p(z)) Importance Weighted Autoencoder log p(x) E q(z x) log p(x, z) q(z x) In general, we don t have accurate and tractable bounds Even in the cases of VAEs and IWAEs, we don t know how accurate the bounds are

Evaluating decoder-based models Currently, results reported using kernel density estimation (KDE) z (1),...,z (S) ˆp (x) = 1 S S s=1 p(z) p (x z (s) ) Can show this is a stochastic lower bound: E[log ˆp (x)] log p (x) Unlikely to perform well in high dimensions Papers caution the reader not to trust the results

Evaluating decoder-based models Our approach: integrate out latent variables using AIS, with Hamiltonian Monte Carlo (HMC) as the transition operator Validate the accuracy of the estimates on simulated data using BDMC Experiment details Real-valued MNIST dataset VAEs, GANs, GMMNs with the following decoder architectures: 10-64-256-256-1024-784 50-1024-1024-1024-784 Spherical Gaussian observations imposed on all models (including VAE)

How accurate are AIS and KDE? (GMMN-50) 350 AIS vs. KDo 300 Log-lLkelLhood 250 200 150 KDo 100 AIS forward AIS Kackward 50 10 1 10 2 10 3 Seconds

How accurate is the IWAE bound? 85.5 AIS vs. IWAo 86.0 Log-lLkelLhood 86.5 87.0 87.5 IWAo AIS AIS+encoder 88.0 10 1 10 2 10 3 10 4 6econds

Estimation of variance parameter 600 tan50 with vyrring vyriynce Log-lLkelLhood 400 200 0 200 400 Tryin AIS Vylid AIS Tryin KDE Vylid KDE 0.005 0.010 0.015 0.020 0.025 VarLance

Comparison of different models AIS estimates are accurate (small BDMC gap) Larger model ==> much higher log-likelihood VAEs achieve much higher log-likelihood than GANs and GMMNs For GANs and GMMNs, no statistically significant difference between training and test log-likelihoods! These models are not just memorizing training examples.

Training curves for a GMMN 600 tmmn50 training curves Log-lLkelLhood 500 400 300 Train AIS Valid AIS Train KDE Valid KDE 200 1000 4000 6000 8000 10000 numeer of Epochs

Training curves for a VAE 1200 VAE50 training curves Log-lLkelLhood 1000 800 600 Train AIS Valid AIS Valid KDE Train IWAE 400 Train KDE Valid IWAE 100200 400 600 800 1000 numeer of Epochs

Missing modes The GAN seriously misallocates probability mass between modes: 200 epochs 1000 epochs But this effect by itself is too small to explain why it underperforms the VAE by over 350 nats

Missing modes To see if the network is missing modes, let s visualize posterior samples given observations. Use AIS to approximately sample z from p(z x), then run the decoder Using BDMC, we can validate the accuracy of AIS samples on simulated data

Missing modes Visualization of posterior samples for validation images data GAN-10 VAE-10 GMMN-10 GAN-50 VAE-50 GMMN-50

Missing modes Posterior samples on training set data GAN-10 VAE-10 GMMN-10 GAN-50 VAE-50 GMMN-50

Missing modes Conjecture: the GAN acts like a frustrated student 200 epochs 1000 epochs

Conclusions AIS gives high-accuracy log-likelihood estimates on MNIST (as validated by BDMC) This lets us observe interesting phenomena that are invisible to KDE GANs and GMMNs are not just memorizing training examples VAEs achieve substantially higher log-likelihoods than GANs and GMMNs This appears to reflect failure to model certain modes of the data distribution Recognition nets can overfit Networks may continue to improve during training, even if KDE estimates don t reflect that Will be interesting to measure the effects of other algorithmic improvements to these networks