Variational Inference for Monte Carlo Objectives

Similar documents
Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Evaluating the Variance of

Variational Autoencoder

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

BACKPROPAGATION THROUGH THE VOID

Lecture 16 Deep Neural Generative Models

arxiv: v4 [cs.lg] 16 Apr 2015

REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models

Latent Variable Models

Auto-Encoding Variational Bayes

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

Resampled Proposal Distributions for Variational Inference and Learning

Resampled Proposal Distributions for Variational Inference and Learning

arxiv: v1 [stat.ml] 3 Nov 2016

Discrete Latent Variable Models

The connection of dropout and Bayesian statistics

arxiv: v4 [cs.lg] 6 Nov 2017

How to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto

Denoising Criterion for Variational Auto-Encoding Framework

Probabilistic Reasoning in Deep Learning

Variational Inference via Stochastic Backpropagation

arxiv: v5 [stat.ml] 5 Aug 2017

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Natural Gradients via the Variational Predictive Distribution

arxiv: v3 [cs.lg] 25 Feb 2016 ABSTRACT

Bayesian Deep Learning

The Origin of Deep Learning. Lili Mou Jan, 2015

Variational Bayes on Monte Carlo Steroids

Variance Reduction in Black-box Variational Inference by Adaptive Importance Sampling

Learning to Disentangle Factors of Variation with Manifold Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Deep Latent-Variable Models of Natural Language

arxiv: v1 [stat.ml] 6 Dec 2018

Recurrent Latent Variable Networks for Session-Based Recommendation

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Variational Autoencoders

Logistic Regression & Neural Networks

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

A Unified View of Deep Generative Models

Notes on pseudo-marginal methods, variational Bayes and ABC

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

STA 414/2104: Lecture 8

Backpropagation Through

Density estimation. Computing, and avoiding, partition functions. Iain Murray

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

arxiv: v2 [stat.ml] 15 Aug 2017

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Appendices: Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Ways to make neural networks generalize better

Approximate Bayesian inference

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Annealing Between Distributions by Averaging Moments

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Variational Inference and Learning. Sargur N. Srihari

Knowledge Extraction from DBNs for Images

Quantitative Biology II Lecture 4: Variational Methods

Probabilistic Graphical Models for Image Analysis - Lecture 4

An efficient way to learn deep generative models

STA 414/2104: Machine Learning

Learning Energy-Based Models of High-Dimensional Data

Answers and expectations

Variational AutoEncoder: An Introduction and Recent Perspectives

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Automatic Differentiation and Neural Networks

Gradient Estimation Using Stochastic Computation Graphs

Stochastic Generative Hashing

Experiments on the Consciousness Prior

Probabilistic & Bayesian deep learning. Andreas Damianou

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Variational Auto Encoders

STA 414/2104: Lecture 8

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Bayesian Semi-supervised Learning with Deep Generative Models

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

CSC 2541: Bayesian Methods for Machine Learning

Bayesian Inference and MCMC

Approximate inference in Energy-Based Models

CSC321 Lecture 20: Autoencoders

UNSUPERVISED LEARNING

Training Deep Generative Models: Variations on a Theme

Deep Boltzmann Machines

STA 4273H: Statistical Machine Learning

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Hamiltonian Variational Auto-Encoder

Markov Networks.

arxiv: v4 [stat.co] 19 May 2015

Variational Autoencoders (VAEs)

Variational Dropout and the Local Reparameterization Trick

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Chapter 20. Deep Generative Models

17 : Optimization and Monte Carlo Methods

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Rapid Introduction to Machine Learning/ Deep Learning

Probabilistic Graphical Models

Statistical Machine Learning from Data

Stochastic Gradient Descent

Transcription:

Variational Inference for Monte Carlo Objectives Andriy Mnih, Danilo Rezende June 21, 2016 1 / 18

Introduction Variational training of directed generative models has been widely adopted. Results depend heavily on the choice of the variational posterior. A variational posterior that is too simple can prevent the model from using much of its capacity. Making it more expressive is one way to avoid this (see e.g. DRAW, normalizing flows). Simpler (orthogonal) alternative: optimize a tighter lower bound on the log-likelihood by throwing more computation at the problem. Burda et al. (2016) used the reparameterization trick to implement this approach in variational autoencoders. We develop a more general version than can also handle the harder case of models with discrete latent variables. We use the availability of multiple samples to implement highly effective variance reduction at virtually no additional cost. 2 / 18

Motivation Continuous latent variables are not always appropriate. Some properties of the world, such as absence/presence, number of objects, are fundamentally discrete. Dependencies between between discrete latent variables can be easier to capture (e.g. DARN). Inference-based learning provides a principled way of training deep models without backpropagation. Inferring the latent variable values effectively makes them observed, breaking the flow of gradients through them. An inference network propagates the information contained in the target/output, which is captured by the backpropagated gradient in differentiable/reparameterized models. 3 / 18

Multi-sample objective for variational inference The standard variational lower bound on log P θ (x) with a variational posterior Q(h x): [ ] [ P(x, h) L(x) = E Q(h x) log = log P(x) + E Q(h x) log P(h x) ]. Q(h x) Q(h x) There is a big penalty for having regions with Q(h x) P(h x). As a result, the learned Q(h x) provides very incomplete coverage of P(h x). A tighter lower bound on log P θ (x) (IWAE, Burda et al., 2016): L K (x) = E Q(h 1:K x) [ log 1 K K k=1 ] P(x, h k ) Q(h k x) Now we have K shots at hitting a high-probability region of P(h x). The learned Q(h x) is now less conservative. 4 / 18

Monte-Carlo objectives Generalization: Objectives of the form [ L K (x) = E Q(h 1:K x) log 1 K ] K f (x, h k ), k=1 where h 1,..., h K are independent samples from some distribution Q(h x). Special case: If f (x, h) is an unbiased Monte Carlo estimator of P(x), then so is Î(h1:K ) = 1 K K k=1 f (x, hk ). [ ] log Î(h1:K ) is a lower bound on log P(x). EQ(h 1:K x) Can think of log Î(h 1:K ) as a stochastic lower bound on log P(x). The bound becomes tighter as K increases, converging to log P(x) in the limit. Q(h x) can be thought of as a proposal distribution as opposed to a variational posterior. 5 / 18

Examples of Monte Carlo objectives The simplest case involves Monte Carlo sampling from the prior: [ L K (x) = E P log 1 ] K K k=1 P(x hk ) with h k P(h). Importance sampling with a learned proposal is usually much more efficient: [ L K (x) = E Q(h 1:K x) log 1 K P(x, h k ] ) K k=1 Q(h k. x) Many other possibilities: Can incorporate variance reduction techniques from IS such as control variates. Can use α-divergence based objectives. 6 / 18

Gradients of the lower bound We would like to maximize the objective [ ] L K (x) = E Q(h 1:K x) log 1 K f (x, h k ) = E K Q(h 1:K x) k=1 [ ] log Î(h1:K ). Its gradient can be expressed as [ θ LK (x) =E Q(h 1:K x) log j Î(h1:K ) ] θ log Q(hj x) + [ ] E Q(h 1:K x) w j j θ log f (x, hj ) where w j f (x,hj ) K k=1 f (x,hk ). The second term is easy to estimate. The first term is much harder. 7 / 18

Estimating the gradients (NVIL-style) Can use the Neural Variational Inference and Learning (NVIL) estimator developed for the single-sample variational objective. Applying it to the multi-sample objective gives: with h k Q(h x) θ LK (x) j (log Î(h1:K ) b(x)) θ log Q(hj x) + j w j θ log f (x, hj ), b(x) is a predictor/baseline trained to predict log Î(h 1:K ). Drawback: uses the same learning signal for all h k, even though some samples will be much better than others. There is no credit assignment within a set of K samples. 8 / 18

Disentangling the learning signals Would like to have a different learning signal for each sample. Key observation: since the K samples are independent, when considering the learning signal for one of the samples, can treat all other samples as constant. Can subtract-out the effect of the other samples to isolate the effect of the sample of interest. What to subtract from log Î(h1:K )? Something very close to it that does not depend on the sample of interest. One idea: train a baseline-like predictor for f (x, h j ). This introduces additional complexity. Can we avoid learning an extra mapping? 9 / 18

VIMCO: simple local learning signals Better idea: estimate f (x, h j ) from the other K 1 f (x, h k ). Two natural choices for estimating f (x, h j ) are: the arithmetic mean: ˆf (x, h j ) = 1 K 1 k j f (x, hk ) the geometric mean: ˆf ( (x, h j ) = exp 1 K 1 k j log f (x, hk ) This gives the following local learning signals: ˆL(h j h j ) = log 1 K f (x, h k ) log 1 f (x, h k ) + K K ˆf (x, h j ). k j k=1 The second term can be seen as a hand-crafted sample-dependent baseline with no free parameters. VIMCO local learning signals work well without any additional variance reduction. ) 10 / 18

Variance reduction: VIMCO vs. NVIL The magnitude (root mean square) of the learning signal for VIMCO and NVIL as a function of the number of samples used in the objective and the number of parameter updates. RMS 10.0 7.5 5.0 VIMCO(2) VIMCO(5) VIMCO(10) NVIL(1) NVIL(2) NVIL(5) NVIL(10) 2.5 0.0 0 500000 1000000 1500000 Steps 11 / 18

Generative modelling: 200-200-200 SBN on MNIST Estimates of the negative log-likelihood (in nats) for generative modelling on MNIST. The model is an SBN with three latent layers of 200 binary units. NUMBER OF TRAINING ALG. SAMPLES VIMCO NVIL RWS 1 95.2 2 93.5 93.6 94.6 5 92.8 93.7 93.4 10 92.6 93.4 93.0 50 91.9 96.2 92.5 12 / 18

Structured prediction with inference (MNIST) 40 50 Bound (nats) 60 70 80 90 0e+00 1e+06 2e+06 3e+06 4e+06 Steps VIMCO(2) VIMCO(5) VIMCO(20) VIMCO(50) NVIL(1) NVIL(2) NVIL(5) NVIL(20) NVIL(50) 13 / 18

Structured prediction without inference (MNIST) 40 50 Bound (nats) 60 70 80 90 0e+00 1e+06 2e+06 3e+06 4e+06 Steps 14 / 18

Conclusions A principled, unbiased approach to optimizing multi-sample variational objectives that just works. Implements effective variance reduction essentially for free. Does not need a learned baseline to work well. Performs better than NVIL and Reweighted Wake Sleep. Can handle both discrete and continuous latent variables and can be combined with the reparameterization trick. 15 / 18

Thank you! 16 / 18

Generative modelling: 200-200-200 SBN on MNIST 90 Bound (nats) 100 110 120 VIMCO(2) VIMCO(5) VIMCO(10) VIMCO(50) NVIL(1) NVIL(2) NVIL(5) NVIL(10) NVIL(50) 0e+00 1e+06 2e+06 3e+06 4e+06 Steps 17 / 18

Structured prediction: completion examples 18 / 18