Auto-Encoding Variational Bayes

Similar documents
Variational Autoencoder

Variational Inference via Stochastic Backpropagation

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

arxiv: v8 [stat.ml] 3 Mar 2014

Probabilistic Graphical Models

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Variational Autoencoders

Variational Autoencoders (VAEs)

Lecture 16 Deep Neural Generative Models

Natural Gradients via the Variational Predictive Distribution

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

Generative models for missing value completion

Nonparametric Inference for Auto-Encoding Variational Bayes

Latent Variable Models

Unsupervised Learning

Variational Bayes on Monte Carlo Steroids

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Citation for published version (APA): Kingma, D. P. (2017). Variational inference & deep learning: A new synthesis

Deep Generative Models

Black-box α-divergence Minimization

Bayesian Deep Learning

Variational Dropout and the Local Reparameterization Trick

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

17 : Optimization and Monte Carlo Methods

Bayesian Semi-supervised Learning with Deep Generative Models

Expectation Maximization

Bayesian Inference and MCMC

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Quantitative Biology II Lecture 4: Variational Methods

The connection of dropout and Bayesian statistics

Denoising Criterion for Variational Auto-Encoding Framework

Variational Auto Encoders

Deep latent variable models

arxiv: v4 [stat.co] 19 May 2015

Density Estimation. Seungjin Choi

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Posterior Regularization

13: Variational inference II

Bayesian Deep Generative Models for Semi-Supervised and Active Learning

Recent Advances in Bayesian Inference Techniques

arxiv: v1 [cs.lg] 15 Jun 2016

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme

A Unified View of Deep Generative Models

GENERATIVE ADVERSARIAL LEARNING

UNSUPERVISED LEARNING

Markov Chain Monte Carlo and Variational Inference: Bridging the Gap

STA 4273H: Sta-s-cal Machine Learning

Algorithms for Variational Learning of Mixture of Gaussians

Training Deep Generative Models: Variations on a Theme

Machine Learning Techniques for Computer Vision

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation Algorithm

SGVB Topic Modeling. by Otto Fabius. Supervised by Max Welling. A master s thesis for MSc in Artificial Intelligence. Track: Learning Systems

Variational Autoencoders. Presented by Alex Beatson Materials from Yann LeCun, Jaan Altosaar, Shakir Mohamed

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Probabilistic Reasoning in Deep Learning

Inference Suboptimality in Variational Autoencoders

Lecture 14: Deep Generative Learning

STA 414/2104: Machine Learning

Variational Inference for Monte Carlo Objectives

Variational AutoEncoder: An Introduction and Recent Perspectives

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University

Variational Inference for Bayesian Neural Networks

A variational radial basis function approximation for diffusion processes

Variational Inference (11/04/13)

STA 4273H: Statistical Machine Learning

arxiv: v4 [cs.lg] 16 Apr 2015

Scalable Non-linear Beta Process Factor Analysis

STA 4273H: Statistical Machine Learning

STA 414/2104: Machine Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

CPSC 540: Machine Learning

Variational Learning : From exponential families to multilinear systems

Unsupervised Learning

Lecture 13 : Variational Inference: Mean Field Approximation

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

The Expectation Maximization or EM algorithm

Lecture : Probabilistic Machine Learning

Advances in Variational Inference

The Origin of Deep Learning. Lili Mou Jan, 2015

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Contrastive Divergence

Restricted Boltzmann Machines

Lecture 6: Graphical Models: Learning

Bayesian Machine Learning

LDA with Amortized Inference

Variational Principal Components

Probabilistic Graphical Models

Probabilistic & Bayesian deep learning. Andreas Damianou

STA414/2104 Statistical Methods for Machine Learning II

Does the Wake-sleep Algorithm Produce Good Density Estimators?

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

The Expectation-Maximization Algorithm

an introduction to bayesian inference

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Approximate Bayesian inference

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Transcription:

Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39

Outline 1 Introduction 2 Variational Lower Bound 3 SGVB Estimators 4 AEVB algorithm 5 Variational Auto-Encoder 6 Experiments and results 7 VAE Examples 8 References Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 2 / 39

Introduction Motivation Deep learning using Autoencoder shows great success on feature extraction. Scaling variational inference to large data set. Approximating the intrtactable posterior can be used for multiply tasks: Recognition Denoising Representation Visualiation Generative Model Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 3 / 39

Introduction Motivation - Intuition How to move from our sample x i to latent space z i, and reconstruct x i. Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 4 / 39

Introduction Problem Intractability: the case where the integral of the marginal likelihood p θ (x) = p θ (z)p θ (x z)dz is intractable (we cannot evaluate or differentiate the marginal likelihood), where the true posterior density p θ (z x) = p θ(x z)p θ (z) o θ (x) is intractable (so the EM algorithm cannot be used), and where the required integrals for any reasonable mean-field VB algorithm are also intractable. A large dataset: we have so much data that batch optimization is too costly; we would like to make parameter updates using small minibatches or even single datapoints. Sampling based solutions, e.g. Monte Carlo EM, would in general be too slow, since it involves a typically expensive sampling loop per datapoint. Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 5 / 39

Variational Lower Bound Bayesian inference θ: Distribution parameters α: Hyper-parameters of the parameters distribution, e.g. θ p(θ α), our prior. X: Samples p(x a) p(x θ)p(θ α)dθ: Marginal likelihood p(θ X, α) p(x θ)p(θ α): Posterior distribution. MAP -Maximum a posteriori: ˆθ MAP (X) = argmax p(x θ)p(θ α) }{{} θ Sample x is sampled by initially sampling θ from θ p(θ α), and then sampling x from x p(x θ) Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 6 / 39

Variational Lower Bound Bayesian inference - Intuition Let z θ be an animal generator. with 0.3 CatGenerator z p(α) = 0.2 DogGenerator 0.5 ParrotGenerator Given our sample x = generator? p(z = CG x, α) = Gen θ, what is the chances that z is a cat p(x z = CG)p(z = CG α) p(x z = Gen)p(z = Gen α) Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 7 / 39

Variational Lower Bound Bayesian inference - Problems Very often directly solving bayesian inference problem will require evaluating intractable integrals. In order to overcome this there are two approaches: Sampling, Mostly methods of MCMC, such as Gibs Sampling, are used in order to find the optimal parameters. Variational methods, such as Mean Field Approximation. Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 8 / 39

Variational Lower Bound Variational Lower Bound Instead of evaluating intractable distibution P, we will find a simpler distribution Q, and use it instead of P where needed: Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 9 / 39

Variational Lower Bound Variational Lower Bound 1 How can we choose our Q? Pick some tractable Q which can explain our data well. Optimize its parameters, using P and our given data. For example, we could pick Q N (µ, σ 2 ) and optimize its parameters µ, σ. Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 10 / 39

Variational Lower Bound Variational Lower Bound 2 Looking at the log probability of the observations X: log p (X ) = log p(x, Z) = log p(x, Z) q(z) q(z) = log(e p(x, Z) q[ q(z) ]) Z *Jensen s inequality Z p(x, Z) E q [log q(z) ] = E q[log(p(x, Z)) log(q(z))] = E q [log(p(x, Z))]+H(Z) L is the Variational Lower Bound. L = E q [log(p(x, Z))] + H(Z)) Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 11 / 39

Variational Lower Bound Kullback Leibler divergence KL Divergence is a measure of how one probability distribution diverges from another. D KL (P Q) = P(i)log Q(i) P(i) = i i D KL (P Q) = P(x)log P(x) Q(x) dx when P(x) = Q(x) for all x X D KL (P Q) = 0 D KL (P Q) is always non-negative. P(i)log P(i) Q(i) Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 12 / 39

Variational Lower Bound Variational Lower Bound 3 D KL (q(z) p(z X )) = = = ( q(z)log q(z) q(z)log P(Z X )) dz = p(x, Z) q(z)log q(z) dz p(x, Z) q(z) dz + log(p(x )) q(z)log(p(x )dz) = q(z)log P(Z X ) Q(Z)) dz = q(z) = L + log(p(x ) }{{} =1 log(p(x )) = L + D KL (q(z) p(z X )) As KL Divergence is always non negative, we get that log(p(x )) L, and either maximizing L or minimizing D KL (q(z) p(z X )) will optimize q. Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 13 / 39

SGVB Estimators Stochastic search variational Bayes Let ψ be distibution q variational parameters. We will want to optimize L w.r.t both Z (generative parameters) and ψ. Seperate L into Ef and h, where h(x, ψ) contains everything in L except for Ef. ψ L = ψ E q [f (z)] + ψ h(x, ψ) ψ h(x, ψ) is tractable, while for intractable ψ E q [f (z)] we will approximate it using Monte Carlo integration: ψ E q [f (z)] 1 S S f (z (s) ) ψ ln q(z (s) ψ) s=1 where z (s) iid q(z ψ) for s = 1..., S. denote the above approximation as ζ. and we recieve the gardient step: ψ t+1 = ψ t + ρ t ψ h(x, ψ (t) ) + ρ t ζ t Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 14 / 39

SGVB Estimators SGVB Estimator The above method exibits very high variance, may converge very slow, and is impractical for our purpose. We will reparametrize the random variable z q ψ (z x) using a diffrentiable transformation g ψ (ɛ, x) of an auxilary noise variable ɛ: z = g ψ (ɛ, x) with ɛ p(ɛ) We can now form Monte Carlo estimation of expection of function f (z) w.r.t q ψ (z x): E qψ (z x (i) ) [f (z)] = E p(ɛ)[f (g ψ (ɛ, x (i) )] 1 S where ɛ (s) p(ɛ) S f (g ψ (ɛ (s), x (i) )) s=1 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 15 / 39

SGVB Estimators Estimator A Applying the above technique to the Variational lower bound L, we yield our generic SGVB estimator L A (θ, ψ; x (i) ) L(θ, ψ; x (i) ): L A (θ, ψ; x (i) ) = 1 S S log p θ (x (i), z (i,s) ) log q ψ (z (i,s) x (i) ) s=1 where z (i,l) = g ψ (ɛ (i,l), x (i) ) and ɛ (l) p(ɛ) Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 16 / 39

SGVB Estimators Estimator B Alternitivly, we can use the above technique on the KL divergence, the recieve another version of our SGVB estimator: L B (θ, ψ; x (i) ) = D KL (q ψ (z x (i) ) p θ (z)) + 1 S where z (i,l) = g ψ (ɛ (i,l), x (i) ) and S log p θ (x (i) z (i,s) )) s=1 ɛ (l) p(ɛ) Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 17 / 39

SGVB Estimators Reparameterization trick - intuition By adding the noise, we reduce the variance, and pay for it with accuracy: Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 18 / 39

AEVB algorithm Auto Encoding Variational Bayes Given multiple data points from data set X with N data points, we can construct an estimator of the marginal likelihood of the data set, based on mini-batches: L(θ, ψ; x (i) ) L M (θ, ψ; x M ) = N M Both estimators A and B can be used. M i=1 L(θ, ψ; x (i) ) Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 19 / 39

AEVB algorithm Minibatch AEVB algorithm 1 θ, ψ Initialize parameters 2 repeat 1 X M Random minibatch of M datapoints (drawn from the full dataset) 2 ɛ Random samples from noise distibution p(ɛ) 3 g ψ,θ L M (θ, ψ; X M, ɛ) (Gardients of minibatch estimator) 4 ψ, θ Update parameters using gardients g (SGD or Adagrad) 3 until convergence of parameters θ, ψ 4 Return θ, ψ Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 20 / 39

Variational Auto-Encoder Variational Auto-Encoder Variational autoencoders (VAEs) were defined in 2013 by Kingma et al. (This article) and Rezende et al. (Google, simulationsly). A variational autoencoder consists of an encoder, a decoder, and a loss function: Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 21 / 39

Variational Auto-Encoder Variational Auto-Encoder Auto-Encoder Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 22 / 39

Variational Auto-Encoder Variational Auto-Encoder Variational Auto-Encoder Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 23 / 39

Variational Auto-Encoder Variational Auto-Encoder The encoder is a neural network. Its input is a datapoint x, its output is a latent (hidden) representation z, and it has weights and biases ψ: The encoder encodes the data which is N-dimensional into a latent (hidden) representation space z which is much less than N dimensions. The lower-dimensional space is stochastic: the encoder outputs parameters to q ψ (z x). Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 24 / 39

Variational Auto-Encoder Variational Auto-Encoder The decoder is another neural network. Its input is the representation z, it outputs the parameters to the probability distribution of the data, and has weights and biases θ: The decoder outputs parameters to p θ (x z) The decoder decodes the real-valued numbers in z into N real-valued numbers. The decoder loosing information. Information is lost because it goes from a smaller to a larger dimensionality: How much information is lost? It measured using the reconstruction log-likelihood log(p θ (x z)). Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 25 / 39

Experiments and results Experiments Using a neural network for our encoder q ψ (z s) (the approximation of pθ(x, z)). Optimizing ψ, θ by the AEVB algorithm. Assume p θ (x z) is a MV Gaussian/Bernouli, computed from z with a MLP p θ (z x) is intractable. Let q ψ (z x) be a Gaussian with diagonal covariance: log q ψ (z x (I ) ) = log N (z; µ (i), σ 2(i) I ) µ (i), σ 2(i) I are outputs of the encoding MLP, i.e. nonlinear functions of x (i) Maximizing the objective function: L(θ, ψ; x (i) ) 1 2 J j=1 (1+log((σ (i) j ) 2 ) µ (i) j 2 (i) σ j 2 )+ 1 L L log p θ (x (i) z (i,l) ) l=1 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 26 / 39

Experiments and results Experiments Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 27 / 39

Experiments and results Experiments Trained generative models of images from the MNIST and Frey Face datasets and compared learning algorithms in terms of the variational lower bound, and the estimated marginal likelihood. The encoder and decoder have an equal number of hidden units. For the Frey Face data used decoder with Gaussian outputs, identical to the encoder, except that the means were constrained to the interval (0,1) using a sigmoidal activation function at the decoder output. The hidden units are the hidden layer of the neural networks of the encoder and decoder. Compared performance of AEVB to the wake-sleep algorithm Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 28 / 39

Experiments and results Experiments Likelihood lower bound: Trained generative models (decoders) and corresponding encoders having 500 hidden units in case of MNIST, and 200 hidden units in case of the Frey Face data. Marginal likelihood: For very low-dimensional latent space it is possible to estimate the marginal likelihood of the learned generative models using an MCMC estimator. For the encoder and decoder used neural networks, this time with 100 hidden units, and 3 latent variables. Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 29 / 39

Experiments and results Experiments Comprasion of AEVB method to the wake-sleep algorithm, in terms of optimizing the lower bound, for different dimensionality of latent space. Vertical axis is the estimated avg Variational Lower Bound per data point, Horizontal axis is the amount of training points evaluated Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 30 / 39

Experiments and results Experiments Comparison of AEVB to the wake-sleep algorithm and Monte Carlo EM, in terms of the estimated marginal likelihood, for a different number of training points. Monte Carlo EM is not an on-line algorithm, and (unlike AEVB and the wake-sleep method) can t be applied efficiently for the full MNIST dataset. Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 31 / 39

Experiments and results Experiments Learned MNIST manifold, on a 2d latent space. Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 32 / 39

VAE Examples SketchRNN Schematic of sketch-rnn. Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 33 / 39

VAE Examples SketchRNN Latent space interpolations generated from a model trained on pig sketches. Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 34 / 39

VAE Examples SketchRNN Learned relationships between abstract concepts, explored using latent vector arithmetic. Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 35 / 39

VAE Examples MusicVAE Schematic of MusicVAE Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 36 / 39

VAE Examples MusicVAE Beat blender Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 37 / 39

References References I Musicvae. https://magenta.tensorflow.org/music-vae. Teaching machines to draw. https://ai.googleblog.com/2017/04/ teaching-machines-to-draw.html. Variational autoencoders. https://www.jeremyjordan.me/variational-autoencoders/. What is variational auto encoder tutorial. https: //jaan.io/what-is-variational-autoencoder-vae-tutorial/. B. J. F. Geoffrey E Hinton, Peter Dayan and R. M. Neal. The wakesleep algorithm for unsupervised neural networks. SCIENCE, page 1158 1158, 1995. Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 38 / 39

References References II M. I. J. John Paisley, David M. Blei. Variational bayesian inference with stochastic search. ICML, 2012. C. W. Matthew D Hoffman, David M Blei and J. Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303 1347. D. P.Kingma and M. Welling. Auto-encoding variational bayes. B. J. P. Simon Duane, Anthony D Kennedy and D. Roweth. Hybrid monte carlo. Physics letters B, 195(2):216 222, 1987. Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 39 / 39