TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Variational Autoencoders

Size: px

Start display at page:

Download "TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Variational Autoencoders"

Corey Barnett
5 years ago
Views:

1 TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Variational Autoencoders 1

2 The Latent Variable Cross-Entropy Objective We will now drop the negation and switch to argmax. Φ = argmax Φ E y Pop ln Q Φ (y) Q Φ (y) = ẑ Q Φ (ẑ, y) EG Identity: Φ ln Q Φ (y) = Eẑ QΦ (ẑ y) Φ ln Q Φ (ẑ, y) 2

3 Variational Autoencoders Φ ln Q Φ (y) = Eẑ QΦ (ẑ y) Φ ln Q Φ (ẑ, y) Except for directed tree models, this gradient must be approximated exact computation is #-P hard. Variational autoencoders approximate Q Φ (ẑ y) with a model supporting easy sampling of ẑ. 3

4 Generative Models A model for which sampling is easy will be called generative. In Variational autoencoders we assume that Q Φ (y ẑ) is generative but that Q Φ (ẑ y) is not that sampling from Q Φ (ẑ y) is hard. We approximate Q Φ (ẑ y) with a generative model. 4

5 Generation Replaces Search Generation replaces search can be viewed as a general principle of Deep leaning. Rather than search for a ẑ that generates y we strive to directly calculate to generate a ẑ that generates y. Generation replaces search is exemplified in current parsing architectures. 5

6 Variational Autoencoders Φ ln Q Φ (y) = Eẑ QΦ (ẑ y) Φ ln Q Φ (ẑ, y) Φ, Ψ = argmax Φ,Ψ E y Pop Eẑ PΨ (ẑ y) ln Q Φ(ẑ, y) + H(P Ψ (ẑ y)) Here P Ψ (ẑ y) is a generative approximation of Q Φ (ẑ y). The quantity being optimized is called the evidence lower bound (ELBO). 6

7 Variational Autoencoders Φ ln Q Φ (y) = Eẑ QΦ (ẑ y) Φ ln Q Φ (ẑ, y) Φ, Ψ = argmax Φ,Ψ = argmax Φ,Ψ E y Pop Eẑ PΨ (ẑ y) ln Q Φ(ẑ, y) + H(P Ψ (ẑ y)) E y Pop ln Q Φ (y) KL(P Ψ (ẑ y), Q Φ (ẑ y)) The equivalence of the two ELBO expressions is proved below. The first expression supports SGD training through sampling. The second expression establishes that the ELBO is a lower bound on the evidence ln Q Φ (y) and that P Ψ (ẑ y) should approximate Q Φ (ẑ y). 7

8 Derivation of Equivalence I Eẑ PΨ (ẑ y) ln Q Φ(ẑ, y) = Eẑ PΨ (ẑ y) ( ln Q Φ(y) + ln Q Φ (ẑ y) ) = ln Q Φ (y) + Eẑ PΨ (ẑ y) ln Q Φ(ẑ y) = ln Q Φ (y) H(P Ψ (ẑ y), Q Φ (ẑ y)) 8

9 Derivation of Equivalence II Eẑ PΨ (ẑ y) ln Q Φ(ẑ, y) + H(P Ψ (ẑ y)) = ln Q Φ (y) H(P Ψ (ẑ y), Q Φ (ẑ y)) + H(P Ψ (ẑ y)) = ln Q Φ (y) KL(P Ψ (ẑ y), Q Φ (ẑ y)) 9

10 EM is Alternating Optimization of the ELBO Eẑ PΨ (ẑ y) ln Q Φ(ẑ, y) + H(P Ψ (ẑ y)) (1) = ln Q Φ (y) KL(P Ψ (ẑ y), Q Φ (ẑ y)) (2) by (2) Ψ = argmin Ψ E y Pop KL(P Ψ (ẑ y), Q Φ (ẑ y)) by (1) Φ = argmax Φ E y Pop Eẑ PΨ (ẑ y) ln Q Φ(ẑ, y) EM: Φ t+1 = argmax Φ E y Pop Eẑ QΦ t(ẑ y) log Q Φ (ẑ, y) 10

11 The Reparameterization Trick Ψ = argmax Ψ E y Pop Eẑ PΨ (ẑ y) ln Q Φ(ẑ, y) + H(P Ψ (ẑ y)) How do we differentiate the sampling? 11

12 The Reparameterization Trick Ψ = argmax Ψ E y Pop Eẑ PΨ (ẑ y) ln Q Φ(ẑ, y) + H(P Ψ (ẑ y)) We note that in practice all sampling is computed by a deterministic function of (pseudo) random numbers. We can make this explicit. Model P Ψ (ẑ y) by ɛ noise, ẑ = ẑ Ψ (y, ɛ) 12

13 The Reparameterization Trick Ψ = argmax Ψ H(P Ψ (ẑ y)) = E ɛ noise ln P Ψ (ẑ Ψ (y, ɛ) y) E y Pop E ɛ noise ln Q Φ (ẑ Ψ (y, ɛ), y) + H(P Ψ (ẑ y For VAEs we typically we have ẑ(y, ɛ) R d with ẑ(y, ɛ)[i] = µ Ψ (y)[i] + σ Ψ (y)[i] ɛ[i] ɛ[i] N (0, 1) This supports easy calculation of P Ψ (ẑ Ψ (y, ɛ) y). 13

14 Decoding with L 2 Distortion An autoencoder encodes and decodes. We can view ẑ Ψ (y, ɛ) as the encoding of y. We now consider a deterministic decoder ŷ Φ (ẑ) and define a model ( ) y ŷ Q Φ (y ẑ) exp Φ (ẑ) 2 2σ 2

15 A VAE for Images Auto-Encoding Variational Bayes, Diederik P Kingma, Max Welling, y ẑ Ψ (y, ɛ) ẑ ŷ Φ (ẑ) y ŷ 2 [Hyeonwoo Noh et al.] 15

16 Deconvoution: Increasing Spatial Dimension Consider a stride 2 convolution y[i, j, c y ] = W [ i, j, c x, c y ]x[2i + i, 2j + j, c x ] y[i, j, c y ] += B[c y ] For deconvolution we use stride 1 with 4 times the channels. ˆx[i, j, cˆx ] = W [ i, j, cŷ, cˆx ]ŷ[i + i, j + j, cˆx ] ˆx[i, j, cˆx ] += B[cˆx ] The channels at each lower resolution pixel ˆx[i, j] are divided among four higher resolution pixels. This is done by a simple reshaping of ˆx. 16

17 Decoding with L 2 Distortion Φ, Ψ = argmax Φ,Ψ E y Pop Eẑ PΨ (ẑ y) ln Q Φ(ẑ, y) + H(P Ψ (ẑ y)) The objective now becomes E y Pop Eẑ PΨ (ẑ y) ( ln P Φ (ẑ) 1 ) 2σ 2 y ŷ Φ(ẑ) 2 +H(P Ψ (ẑ y)) 17

18 Decoding with L 2 Distortion Switching back to minimization, we can now rewrite the objective as min E y,ɛ ( ẑ Ψ (y, ɛ) Φ λ y ŷ Φ(ẑ Ψ (y, ɛ)) 2 ) ẑ Ψ (y, ɛ) Ψ,y ẑ Φ = log 2 P Φ (ẑ) ẑ Ψ,y = log 2 P Ψ (ẑ y) For ẑ discrete, ẑ Φ is the code length of ẑ(y, ɛ) under an optimal code for P Φ. ẑ Ψ,y is the code length for ẑ under the code for P Ψ (ẑ y). 18

19 Soft EM is to Hard EM as VAE is to Rate-Distortion (Soft) EM: Φ t+1 = argmax Φ E y Pop Eẑ QΦ t(ẑ y) log Q Φ (ẑ, y) Hard EM: Φ t+1 = argmax Φ E y Pop Q Φ (ẑ(y), y) ẑ(y) = argmax ẑ Q Φ t(ẑ y) VAE: min E y,ɛ ẑ Ψ (y, ɛ) Φ λ y ŷ Φ(ẑ Ψ (y, ɛ)) 2 ẑ Ψ (y, ɛ) Ψ,y RD: min E y ẑ Ψ (y) Φ λ y ŷ Φ(ẑ Ψ (y)) 2

20 Sampling P Ψ (ẑ y) ẑ Q Φ (ẑ, y) [Hyeonwoo Noh et al.] Sampling uses just the second half Q Φ (ẑ, y). 20

21 Sampling from Gaussian Variational Autoencoders [Alec Radford] 21

22 Why Blurry? A common explanation for the blurryness of images generated from VAEs is the use of L 2 as the distortion measure. It does seem that L 1 works better (see the slides on image-toimage GANs). However, training on L 2 distortion can produce sharp images in rate-distortion autoencoders (see the slides on ratedistortion autoencoders). 22

23 END

Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p