Variational Autoencoders (VAEs)

Similar documents
Variational Autoencoder

Variational Autoencoders

Auto-Encoding Variational Bayes

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Generative models for missing value completion

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Variational Autoencoders

Variational Inference (11/04/13)

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Natural Gradients via the Variational Predictive Distribution

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Expectation Maximization

Latent Variable Models

Nonparametric Inference for Auto-Encoding Variational Bayes

Probabilistic Graphical Models for Image Analysis - Lecture 4

The Expectation Maximization or EM algorithm

Probabilistic Graphical Models

Bayesian Deep Learning

The Success of Deep Generative Models

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

Variational Inference. Sargur Srihari

Bayesian Semi-supervised Learning with Deep Generative Models

Variational Inference via Stochastic Backpropagation

A Unified View of Deep Generative Models

Variational AutoEncoder: An Introduction and Recent Perspectives

14 : Mean Field Assumption

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Variational Auto-Encoders (VAE)

Generating Sentences by Editing Prototypes

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Unsupervised Learning

Generative Adversarial Networks

Latent Variable Models and EM algorithm

Posterior Regularization

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Quantitative Biology II Lecture 4: Variational Methods

Probabilistic Reasoning in Deep Learning

Deep Generative Models. (Unsupervised Learning)

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,

13: Variational inference II

Data Mining Techniques

Inference Suboptimality in Variational Autoencoders

Hands-On Learning Theory Fall 2016, Lecture 3

The connection of dropout and Bayesian statistics

Variational Auto Encoders

Deep Generative Models for Graph Generation. Jian Tang HEC Montreal CIFAR AI Chair, Mila

B PROOF OF LEMMA 1. Published as a conference paper at ICLR 2018

Introduction to Statistical Learning Theory

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Series 7, May 22, 2018 (EM Convergence)

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Expectation Maximization Algorithm

Deep latent variable models

PATTERN RECOGNITION AND MACHINE LEARNING

The Variational Gaussian Approximation Revisited

Lecture 14: Deep Generative Learning

Generative adversarial networks

Chapter 20. Deep Generative Models

STA 4273H: Statistical Machine Learning

Lecture 16 Deep Neural Generative Models

SGVB Topic Modeling. by Otto Fabius. Supervised by Max Welling. A master s thesis for MSc in Artificial Intelligence. Track: Learning Systems

arxiv: v1 [cs.lg] 6 Dec 2018

MODULE -4 BAYEIAN LEARNING

Quasi-Monte Carlo Flows

Variational inference

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Variational Principal Components

Gaussian Mixture Models

Gaussian Mixture Models

topics about f-divergence

Lecture : Probabilistic Machine Learning

arxiv: v1 [cs.lg] 15 Jun 2016

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Energy-Based Generative Adversarial Network

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme

GAUSSIAN PROCESS REGRESSION

UNSUPERVISED LEARNING

Variational Autoencoders. Presented by Alex Beatson Materials from Yann LeCun, Jaan Altosaar, Shakir Mohamed

GENERATIVE ADVERSARIAL LEARNING

Deep Generative Models

Citation for published version (APA): Kingma, D. P. (2017). Variational inference & deep learning: A new synthesis

Denoising Criterion for Variational Auto-Encoding Framework

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Bayesian Machine Learning

Expectation Propagation in Dynamical Systems

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY. Arto Klami

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Black-box α-divergence Minimization

LDA with Amortized Inference

Lecture 13 : Variational Inference: Mean Field Approximation

ECE521 week 3: 23/26 January 2017

CAUSAL GAN: LEARNING CAUSAL IMPLICIT GENERATIVE MODELS WITH ADVERSARIAL TRAINING

Stochastic Variational Inference

Variational Inference. Sargur Srihari

Machine Learning Techniques for Computer Vision

AdaGAN: Boosting Generative Models

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Supplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization

Transcription:

September 26 & October 3, 2017

Section 1 Preliminaries

Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p q) = p(x) log p(x) dx. (1.1) q(x) By Jensen s Inequality, KL(p q) only if p = q, almost everywhere. 0, and the equation holds if and

Kullback-Leibler divergence Special case: multivariate Gaussian distribution Suppose k-dimensional variable p 1 N (µ 1, 1 ), p 2 N (µ 2, 2 ), then KL(p 1 p 2 )= 1 apple log det( 1) k +tr( 2 1 2 det( 2 ) 1)+ (µ 2 µ 1 ) > 1 2 (µ 2 µ 1 ) i.

Variational Inference Suppose we want to use Q(z) toapproximatep(z X ), where p(z X ) does not have a explicit representation, then a good approximation would try to minimize Z KL(Q(z) P(z X )) = Q(z) log Q(z) P(z X ) dz. By Bayes formula, the above equation could be transferred into log P(X ) KL(Q(z) P(z X )) = Z Q(z) log P(X z)dz KL(Q(z) P(z)). (1.2)

1 Section 2 Variational Autoencoders 1 A. B. L. Larsen et al. (2015). Autoencoding beyond pixels using a learned similarity metric. In: arxiv preprint arxiv:1512.09300.

Original problem Given a dataset X from a distribution P(x), we want to generate new data that satisfies the unknown distribution P(x). We construct a model f (z; ) :Z! X,whereX is the space of observed variables (datas), Z the space of latent variables, the parameter space, and f a complex but deterministic mapping. Latent Variables: Variables that are not directly observed but are rather inferred from other directly observed variables. Given z, we can generate a sample X by f (z; ). We wish to optimize such that we can sample z from P(z) and, with high probability, f (z; ) willbelikethex s in our dataset.

Original problem Given a dataset X from a distribution P(x), we want to generate new data that satisfies the unknown distribution P(x). We construct a model f (z; ) :Z! X,whereX is the space of observed variables (datas), Z the space of latent variables, the parameter space, and f a complex but deterministic mapping. Latent Variables: Variables that are not directly observed but are rather inferred from other directly observed variables. Given z, we can generate a sample X by f (z; ). We wish to optimize such that we can sample z from P(z) and, with high probability, f (z; ) willbelikethex s in our dataset.

Original problem Given a dataset X from a distribution P(x), we want to generate new data that satisfies the unknown distribution P(x). We construct a model f (z; ) :Z! X,whereX is the space of observed variables (datas), Z the space of latent variables, the parameter space, and f a complex but deterministic mapping. Latent Variables: Variables that are not directly observed but are rather inferred from other directly observed variables. Given z, we can generate a sample X by f (z; ). We wish to optimize such that we can sample z from P(z) and, with high probability, f (z; ) willbelikethex s in our dataset.

Likelihood Z P(X ; ) = Choose to maximize the above integral. P(X z; ) P(z)dz (2.1) In VAEs, P(X z; ) N (f (z; ), 2 I ) in continuous case, and P(X z; ) B(f (z; )) in discrete case. In both cases, P(X z; ) is continuous with respect to theta, so we can use gradient ascent to maximize. Questions: How to define the latent variable z to capture latent information? How to deal with the integral over z, and its gradient with respect to?

Likelihood Z P(X ; ) = Choose to maximize the above integral. P(X z; ) P(z)dz (2.1) In VAEs, P(X z; ) N (f (z; ), 2 I ) in continuous case, and P(X z; ) B(f (z; )) in discrete case. In both cases, P(X z; ) is continuous with respect to theta, so we can use gradient ascent to maximize. Questions: How to define the latent variable z to capture latent information? How to deal with the integral over z, and its gradient with respect to?

Define latent variable We want the latent variable satisfies these two properties: The latent variables are chosen automatically, because we do not know too much about the intrinsic properties of X. Di erent components of z are mutually independent, in order to avoid the overlap in latent information. VAEs asserts that the latent variable could be drawn from standard Gaussian distribution, N (0, I ). Assertion Any distribution in d dimensions can be generated by taking a set of d variables that are normally distributed and mapping them through a su ciently complicated function. Since f (z, ) is complicated enough (trained by neural network), this choice of latent variable will not matter too much.

Deal with the integral P(X ; ) 1 X P(X z (i) ; ), z (i) N (0, I ). n i Figure: Contradict Example. We need to set need a very large dataset. very small, which will In this case, we need to choose a faster sampling procedure of z.

Deal with the integral Sampling in VAEs The key idea behind the variational autoencoder is to attempt to sample values of z that are likely to have produced X,and compute P(X ) just from those. New function Q(z): gives us a distribution over z values that are likely to produce X.ThenE P(z) [P(X z)]! E Q(z) [P(X z)]. We can see that P(z X ) is the optimum choice of Q(z), but P is intractable. Aim: Find a Q(z) which is an approximation of P(z X ), with Q(z) simple enough.

Recall: Variational Inference For any Q(z), use Q(z) to approximate P(z X ). According to Equation (1.2), log P(X ) KL(Q(z) P(z X )) = E Q(z) [log P(X z)] KL(Q(z) P(z)) Since were interested in inferring P(X ), it makes sense to construct a Q which does depend on X : log P(X ) KL(Q(z X ) P(z X )) = E Q(z X ) [log P(X z)] KL(Q(z X ) P(z)). (2.2) Aim: Maximize log P(x) (w.r.t. ), minimize KL(Q(z X ) P(z X ))., Maximize LHS, Maximize RHS.

Second term of RHS Aim: Minimize KL(Q(z X ) P(z)). We already have P(z) N (0, I ). The usual choice is to define Q(z X ) N (µ(x ; ), (X ; )), where µ and are deterministic functions of X with parameters. (We omit in the following equations.) Besides, we constrain to be a diagonal matrix. Minimization According to previous equation of KL-divergence of multivariate Gaussian distribution, KL(Q(z X ) P(z)) = 1 2 (tr (X )+(µ(x ))> (µ(x )) k log(det (X ))).

First term of RHS The maximization of the first item uses SGD. To approximate the distribution, take a sample ẑ from Q(z X ), and E Q(z X ) [log P(X z)] log P(X ẑ). General Maximization function E X D [log P(X ) KL(Q(z X ) P(z X ))] = E X D [E z Q X [log P(X z)] KL(Q(z X ) P(z))]. (2.3) To use SGD, sample a value X and a value z, then compute the gradient of RHS by backpropagation. Do this for m times and take the average to get the result converging to the gradient of RHS.

First term of RHS The maximization of the first item uses SGD. To approximate the distribution, take a sample ẑ from Q(z X ), and E Q(z X ) [log P(X z)] log P(X ẑ). General Maximization function E X D [log P(X ) KL(Q(z X ) P(z X ))] = E X D [E z Q X [log P(X z)] KL(Q(z X ) P(z))]. (2.3) To use SGD, sample a value X and a value z, then compute the gradient of RHS by backpropagation. Do this for m times and take the average to get the result converging to the gradient of RHS.

Figure: Flow chart for the VAE algorithm.

Significant Problems The algorithm seems to be perfect, but there are two significant problems during the calculation: The gradient of first term of RHS in Equation (2.3) should have included the parameters of both P and Q, but in our sampling method, we omit the parameters of Q. In this case, we cannot generate the true gradient of. The algorithm is separated into 2 parts: the first half train the model Q(z X ) by the given data X, the second half train the model f by the newly-sampling data z. Thus the backpropagation rule cannot cover this discontinuous point, making the algorthm fail.

Significant Problems The algorithm seems to be perfect, but there are two significant problems during the calculation: The gradient of first term of RHS in Equation (2.3) should have included the parameters of both P and Q, but in our sampling method, we omit the parameters of Q. In this case, we cannot generate the true gradient of. The algorithm is separated into 2 parts: the first half train the model Q(z X ) by the given data X, the second half train the model f by the newly-sampling data z. Thus the backpropagation rule cannot cover this discontinuous point, making the algorthm fail.

Modification by Reparameterization Trick To solve the first problem, we need to change the way of sampling. We firstly sample a N (0, I ), then define z = (X ) 1/2 + µ(x ). It is just the equivalent representation of the sample z in previous algorithm, but now the optimization function is changed into E X D [E N (0,I ) [log P(X µ(x )+ (X ) 1/2 )] KL(Q(z X ) P(z))]. This time the sampling function does not include our target function. Sample from Q(z X ) by evaluating a function h(, X ), where is an unobserved noise, and h continuous in X.(Discrete Q(z X ) fails in this case.) Then the backpropagation can be operated successfully.

Modification by Reparameterization Trick To solve the first problem, we need to change the way of sampling. We firstly sample a N (0, I ), then define z = (X ) 1/2 + µ(x ). It is just the equivalent representation of the sample z in previous algorithm, but now the optimization function is changed into E X D [E N (0,I ) [log P(X µ(x )+ (X ) 1/2 )] KL(Q(z X ) P(z))]. This time the sampling function does not include our target function. Sample from Q(z X ) by evaluating a function h(, X ), where is an unobserved noise, and h continuous in X.(Discrete Q(z X ) fails in this case.) Then the backpropagation can be operated successfully.

Figure: Flow chart for the corrected VAE algorithm.

Verification For decoder: We can just sample a random variable z N (0, I )andinputit into the decoder to find the f. The probability P(X )foratestingexamplex : This is not tractable, because P is implicit. However, according to Equation (2.2), since KL divergence is non-negative, we can find a lower bound of log P(X ), which is called Expectation of Lower BOund (ELBO) of P(X ). This lower bound can be a useful tool for getting a rough idea of how well our model is capturing a particular datapoint X,because its fast convergence.

Verification For decoder: We can just sample a random variable z N (0, I )andinputit into the decoder to find the f. The probability P(X )foratestingexamplex : This is not tractable, because P is implicit. However, according to Equation (2.2), since KL divergence is non-negative, we can find a lower bound of log P(X ), which is called Expectation of Lower BOund (ELBO) of P(X ). This lower bound can be a useful tool for getting a rough idea of how well our model is capturing a particular datapoint X,because its fast convergence.

Remarks Detailed remarks are not presented here. Interpretation of RHS. The two terms have their meanings in information theory. Separate the RHS by sample. Regularization term. It could be found by some transformation on RHS. Sampling for Q(z X ). The original paper expresses this distribution with g(x, ), where p independently. Restriction on p is needed. 2 2 D. P. Kingma and M. Welling (2013). Auto-encoding variational bayes. In: arxiv preprint arxiv:1312.6114.

Comparison Versus GAN Section 3 Extensions of VAEs

Comparison Versus GAN Both are newly deep generative models. The biggest advantage of VAEs is the nice probabilistic formulation they come with as a result of maximizing a lower bound on the log-likelihood. Also, VAE is usually easier to train and get working. Relatively easy to implement and robust to hyperparameter choices. GANs are better at generating visual features. Sometimes the output of VAEs is vague. More detailed discussions are shown on Reddit.

Conditional Variational Autoencoders Original problem: Given input dataset X and output Y,wewanttocreateamodel P(Y X ) which maximizes the probability of the ground truth distribution. Example: Generating Hand-write digits. We want to add digits to an existing string of digits written by a single person. A standard regression model will fail in this situation, because it will finally generate an average image with the minimum in distance, which may look like a meaningless blur. However, CVAEs allow us to tackle problems where the input-to-output mapping is one-to-many, without requiring us to explicitly specify the structure of the output distribution.

Conditional Variational Autoencoders Original problem: Given input dataset X and output Y,wewanttocreateamodel P(Y X ) which maximizes the probability of the ground truth distribution. Example: Generating Hand-write digits. We want to add digits to an existing string of digits written by a single person. A standard regression model will fail in this situation, because it will finally generate an average image with the minimum in distance, which may look like a meaningless blur. However, CVAEs allow us to tackle problems where the input-to-output mapping is one-to-many, without requiring us to explicitly specify the structure of the output distribution.

Conditional Variational Autoencoders Figure: Flow chart for the CVAE algorithm.

Conditional Variational Autoencoders P(Y X )=N(f(z, X ), 2 I ); log P(Y X ) KL(Q(z Y, X ) P(z Y, X )) = E Q(z Y,X ) [log P(Y z, X )] KL(Q(z Y, X ) P(z X )).

VAE-GAN 3 Combine a VAE with a GAN by collapsing the decoder and the generator into one, since they are both from standard Gaussian distribution to X. Figure: Overview of the VAE-GAN algorithm. 3 A. B. L. Larsen et al. (2015). Autoencoding beyond pixels using a learned similarity metric. In: arxiv preprint arxiv:1512.09300.

Instead of analyzing the error element-wise, VAE-GAN analyses the error feature-wise, where the feature is generated by Discriminator. Share the parameters of Generator and Decoder together. Optimize three kinds of errors simultaneously. Figure: Flow of the VAE-GAN algorithm. Grey arrows represents the terms in the training objective.

That s all. Thanks!