September 26 & October 3, 2017
Section 1 Preliminaries
Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p q) = p(x) log p(x) dx. (1.1) q(x) By Jensen s Inequality, KL(p q) only if p = q, almost everywhere. 0, and the equation holds if and
Kullback-Leibler divergence Special case: multivariate Gaussian distribution Suppose k-dimensional variable p 1 N (µ 1, 1 ), p 2 N (µ 2, 2 ), then KL(p 1 p 2 )= 1 apple log det( 1) k +tr( 2 1 2 det( 2 ) 1)+ (µ 2 µ 1 ) > 1 2 (µ 2 µ 1 ) i.
Variational Inference Suppose we want to use Q(z) toapproximatep(z X ), where p(z X ) does not have a explicit representation, then a good approximation would try to minimize Z KL(Q(z) P(z X )) = Q(z) log Q(z) P(z X ) dz. By Bayes formula, the above equation could be transferred into log P(X ) KL(Q(z) P(z X )) = Z Q(z) log P(X z)dz KL(Q(z) P(z)). (1.2)
1 Section 2 Variational Autoencoders 1 A. B. L. Larsen et al. (2015). Autoencoding beyond pixels using a learned similarity metric. In: arxiv preprint arxiv:1512.09300.
Original problem Given a dataset X from a distribution P(x), we want to generate new data that satisfies the unknown distribution P(x). We construct a model f (z; ) :Z! X,whereX is the space of observed variables (datas), Z the space of latent variables, the parameter space, and f a complex but deterministic mapping. Latent Variables: Variables that are not directly observed but are rather inferred from other directly observed variables. Given z, we can generate a sample X by f (z; ). We wish to optimize such that we can sample z from P(z) and, with high probability, f (z; ) willbelikethex s in our dataset.
Original problem Given a dataset X from a distribution P(x), we want to generate new data that satisfies the unknown distribution P(x). We construct a model f (z; ) :Z! X,whereX is the space of observed variables (datas), Z the space of latent variables, the parameter space, and f a complex but deterministic mapping. Latent Variables: Variables that are not directly observed but are rather inferred from other directly observed variables. Given z, we can generate a sample X by f (z; ). We wish to optimize such that we can sample z from P(z) and, with high probability, f (z; ) willbelikethex s in our dataset.
Original problem Given a dataset X from a distribution P(x), we want to generate new data that satisfies the unknown distribution P(x). We construct a model f (z; ) :Z! X,whereX is the space of observed variables (datas), Z the space of latent variables, the parameter space, and f a complex but deterministic mapping. Latent Variables: Variables that are not directly observed but are rather inferred from other directly observed variables. Given z, we can generate a sample X by f (z; ). We wish to optimize such that we can sample z from P(z) and, with high probability, f (z; ) willbelikethex s in our dataset.
Likelihood Z P(X ; ) = Choose to maximize the above integral. P(X z; ) P(z)dz (2.1) In VAEs, P(X z; ) N (f (z; ), 2 I ) in continuous case, and P(X z; ) B(f (z; )) in discrete case. In both cases, P(X z; ) is continuous with respect to theta, so we can use gradient ascent to maximize. Questions: How to define the latent variable z to capture latent information? How to deal with the integral over z, and its gradient with respect to?
Likelihood Z P(X ; ) = Choose to maximize the above integral. P(X z; ) P(z)dz (2.1) In VAEs, P(X z; ) N (f (z; ), 2 I ) in continuous case, and P(X z; ) B(f (z; )) in discrete case. In both cases, P(X z; ) is continuous with respect to theta, so we can use gradient ascent to maximize. Questions: How to define the latent variable z to capture latent information? How to deal with the integral over z, and its gradient with respect to?
Define latent variable We want the latent variable satisfies these two properties: The latent variables are chosen automatically, because we do not know too much about the intrinsic properties of X. Di erent components of z are mutually independent, in order to avoid the overlap in latent information. VAEs asserts that the latent variable could be drawn from standard Gaussian distribution, N (0, I ). Assertion Any distribution in d dimensions can be generated by taking a set of d variables that are normally distributed and mapping them through a su ciently complicated function. Since f (z, ) is complicated enough (trained by neural network), this choice of latent variable will not matter too much.
Deal with the integral P(X ; ) 1 X P(X z (i) ; ), z (i) N (0, I ). n i Figure: Contradict Example. We need to set need a very large dataset. very small, which will In this case, we need to choose a faster sampling procedure of z.
Deal with the integral Sampling in VAEs The key idea behind the variational autoencoder is to attempt to sample values of z that are likely to have produced X,and compute P(X ) just from those. New function Q(z): gives us a distribution over z values that are likely to produce X.ThenE P(z) [P(X z)]! E Q(z) [P(X z)]. We can see that P(z X ) is the optimum choice of Q(z), but P is intractable. Aim: Find a Q(z) which is an approximation of P(z X ), with Q(z) simple enough.
Recall: Variational Inference For any Q(z), use Q(z) to approximate P(z X ). According to Equation (1.2), log P(X ) KL(Q(z) P(z X )) = E Q(z) [log P(X z)] KL(Q(z) P(z)) Since were interested in inferring P(X ), it makes sense to construct a Q which does depend on X : log P(X ) KL(Q(z X ) P(z X )) = E Q(z X ) [log P(X z)] KL(Q(z X ) P(z)). (2.2) Aim: Maximize log P(x) (w.r.t. ), minimize KL(Q(z X ) P(z X ))., Maximize LHS, Maximize RHS.
Second term of RHS Aim: Minimize KL(Q(z X ) P(z)). We already have P(z) N (0, I ). The usual choice is to define Q(z X ) N (µ(x ; ), (X ; )), where µ and are deterministic functions of X with parameters. (We omit in the following equations.) Besides, we constrain to be a diagonal matrix. Minimization According to previous equation of KL-divergence of multivariate Gaussian distribution, KL(Q(z X ) P(z)) = 1 2 (tr (X )+(µ(x ))> (µ(x )) k log(det (X ))).
First term of RHS The maximization of the first item uses SGD. To approximate the distribution, take a sample ẑ from Q(z X ), and E Q(z X ) [log P(X z)] log P(X ẑ). General Maximization function E X D [log P(X ) KL(Q(z X ) P(z X ))] = E X D [E z Q X [log P(X z)] KL(Q(z X ) P(z))]. (2.3) To use SGD, sample a value X and a value z, then compute the gradient of RHS by backpropagation. Do this for m times and take the average to get the result converging to the gradient of RHS.
First term of RHS The maximization of the first item uses SGD. To approximate the distribution, take a sample ẑ from Q(z X ), and E Q(z X ) [log P(X z)] log P(X ẑ). General Maximization function E X D [log P(X ) KL(Q(z X ) P(z X ))] = E X D [E z Q X [log P(X z)] KL(Q(z X ) P(z))]. (2.3) To use SGD, sample a value X and a value z, then compute the gradient of RHS by backpropagation. Do this for m times and take the average to get the result converging to the gradient of RHS.
Figure: Flow chart for the VAE algorithm.
Significant Problems The algorithm seems to be perfect, but there are two significant problems during the calculation: The gradient of first term of RHS in Equation (2.3) should have included the parameters of both P and Q, but in our sampling method, we omit the parameters of Q. In this case, we cannot generate the true gradient of. The algorithm is separated into 2 parts: the first half train the model Q(z X ) by the given data X, the second half train the model f by the newly-sampling data z. Thus the backpropagation rule cannot cover this discontinuous point, making the algorthm fail.
Significant Problems The algorithm seems to be perfect, but there are two significant problems during the calculation: The gradient of first term of RHS in Equation (2.3) should have included the parameters of both P and Q, but in our sampling method, we omit the parameters of Q. In this case, we cannot generate the true gradient of. The algorithm is separated into 2 parts: the first half train the model Q(z X ) by the given data X, the second half train the model f by the newly-sampling data z. Thus the backpropagation rule cannot cover this discontinuous point, making the algorthm fail.
Modification by Reparameterization Trick To solve the first problem, we need to change the way of sampling. We firstly sample a N (0, I ), then define z = (X ) 1/2 + µ(x ). It is just the equivalent representation of the sample z in previous algorithm, but now the optimization function is changed into E X D [E N (0,I ) [log P(X µ(x )+ (X ) 1/2 )] KL(Q(z X ) P(z))]. This time the sampling function does not include our target function. Sample from Q(z X ) by evaluating a function h(, X ), where is an unobserved noise, and h continuous in X.(Discrete Q(z X ) fails in this case.) Then the backpropagation can be operated successfully.
Modification by Reparameterization Trick To solve the first problem, we need to change the way of sampling. We firstly sample a N (0, I ), then define z = (X ) 1/2 + µ(x ). It is just the equivalent representation of the sample z in previous algorithm, but now the optimization function is changed into E X D [E N (0,I ) [log P(X µ(x )+ (X ) 1/2 )] KL(Q(z X ) P(z))]. This time the sampling function does not include our target function. Sample from Q(z X ) by evaluating a function h(, X ), where is an unobserved noise, and h continuous in X.(Discrete Q(z X ) fails in this case.) Then the backpropagation can be operated successfully.
Figure: Flow chart for the corrected VAE algorithm.
Verification For decoder: We can just sample a random variable z N (0, I )andinputit into the decoder to find the f. The probability P(X )foratestingexamplex : This is not tractable, because P is implicit. However, according to Equation (2.2), since KL divergence is non-negative, we can find a lower bound of log P(X ), which is called Expectation of Lower BOund (ELBO) of P(X ). This lower bound can be a useful tool for getting a rough idea of how well our model is capturing a particular datapoint X,because its fast convergence.
Verification For decoder: We can just sample a random variable z N (0, I )andinputit into the decoder to find the f. The probability P(X )foratestingexamplex : This is not tractable, because P is implicit. However, according to Equation (2.2), since KL divergence is non-negative, we can find a lower bound of log P(X ), which is called Expectation of Lower BOund (ELBO) of P(X ). This lower bound can be a useful tool for getting a rough idea of how well our model is capturing a particular datapoint X,because its fast convergence.
Remarks Detailed remarks are not presented here. Interpretation of RHS. The two terms have their meanings in information theory. Separate the RHS by sample. Regularization term. It could be found by some transformation on RHS. Sampling for Q(z X ). The original paper expresses this distribution with g(x, ), where p independently. Restriction on p is needed. 2 2 D. P. Kingma and M. Welling (2013). Auto-encoding variational bayes. In: arxiv preprint arxiv:1312.6114.
Comparison Versus GAN Section 3 Extensions of VAEs
Comparison Versus GAN Both are newly deep generative models. The biggest advantage of VAEs is the nice probabilistic formulation they come with as a result of maximizing a lower bound on the log-likelihood. Also, VAE is usually easier to train and get working. Relatively easy to implement and robust to hyperparameter choices. GANs are better at generating visual features. Sometimes the output of VAEs is vague. More detailed discussions are shown on Reddit.
Conditional Variational Autoencoders Original problem: Given input dataset X and output Y,wewanttocreateamodel P(Y X ) which maximizes the probability of the ground truth distribution. Example: Generating Hand-write digits. We want to add digits to an existing string of digits written by a single person. A standard regression model will fail in this situation, because it will finally generate an average image with the minimum in distance, which may look like a meaningless blur. However, CVAEs allow us to tackle problems where the input-to-output mapping is one-to-many, without requiring us to explicitly specify the structure of the output distribution.
Conditional Variational Autoencoders Original problem: Given input dataset X and output Y,wewanttocreateamodel P(Y X ) which maximizes the probability of the ground truth distribution. Example: Generating Hand-write digits. We want to add digits to an existing string of digits written by a single person. A standard regression model will fail in this situation, because it will finally generate an average image with the minimum in distance, which may look like a meaningless blur. However, CVAEs allow us to tackle problems where the input-to-output mapping is one-to-many, without requiring us to explicitly specify the structure of the output distribution.
Conditional Variational Autoencoders Figure: Flow chart for the CVAE algorithm.
Conditional Variational Autoencoders P(Y X )=N(f(z, X ), 2 I ); log P(Y X ) KL(Q(z Y, X ) P(z Y, X )) = E Q(z Y,X ) [log P(Y z, X )] KL(Q(z Y, X ) P(z X )).
VAE-GAN 3 Combine a VAE with a GAN by collapsing the decoder and the generator into one, since they are both from standard Gaussian distribution to X. Figure: Overview of the VAE-GAN algorithm. 3 A. B. L. Larsen et al. (2015). Autoencoding beyond pixels using a learned similarity metric. In: arxiv preprint arxiv:1512.09300.
Instead of analyzing the error element-wise, VAE-GAN analyses the error feature-wise, where the feature is generated by Discriminator. Share the parameters of Generator and Decoder together. Optimize three kinds of errors simultaneously. Figure: Flow of the VAE-GAN algorithm. Grey arrows represents the terms in the training objective.
That s all. Thanks!