Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference

Size: px

Start display at page:

Download "Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference"

Dale McDowell
5 years ago
Views:

1 Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference Louis C. Tiao 1 Edwin V. Bonilla 2 Fabio Ramos 1 July 22, University of Sydney, 2 University of New South Wales

2 Motivation: Unpaired Image-to-Image Translation Monet Photos Zebras Horses Summer Monet photo zebra horse summer photo Monet horse zebra winter Photograph Monet Van Gogh Paired n xi n o summer Ukiyo-e Unpaired X Y, o winter ( )( ) n,,, yi o Cezanne Winter Figure 1: From Zhu et al. (2017) 1

3 Cycle-Consistent Adversarial Learning (CycleGAN) Introduced by Kim et al. (2017); Zhu et al. (2017) Forward and reverse mappings m ϕ : x z and µ θ : z x Discriminators D α and D β Distribution matching (gan objectives) Yield realistic outputs in the other domain. l reverse gan (α; ϕ) = E p (z)[log D α (z)] + E q (x)[log(1 D α (m ϕ (x)))], lgan forward (β; θ) = E p (x)[log D β (x)] + E p (z)[log(1 D β (µ θ (z)))]. Cycle-consistency losses Encourage tighter correspondences must be able to reconstruct output from input and vice versa. May alleviate mode-collapse l reverse const (θ, ϕ) = E q (x)[ x µ θ (m ϕ (x)) ρ ρ], l forward const (θ, ϕ) = E p (z)[ z m ϕ (µ θ (z)) ρ ρ]. 2

4 Contributions We cast the problem of learning inter-domain correspondences without paired data as approximate Bayesian inference in a latent variable model (lvm). 1. We introduce implicit latent variable models (ilvms), prior over latent variables specified flexibly as implicit distribution. 2. We develop a new variational inference (vi) algorithm based on minimizing the symmetric Kullback-Leibler (kl) divergence between a variational and exact joint distribution. 3. We demonstrate that cyclegan (Kim et al., 2017; Zhu et al., 2017) can be instantiated as a special case of our framework. 3

5 Implicit Latent Variable Models Join Distribution p θ (x, z) = p θ (x z) p (z) }{{}}{{} likelihood prior z n x n N θ Prescribed Likelihood Likelihood p θ (x n z n ) is prescribed (as usual) Implicit Prior Prior p (z) over latent variables specified as implicit distribution Given only by a finite collection Z = {z m} M m=1 of its samples, z m p (z) Offers utmost degree of flexibility in treatment of prior information. 4

Implicit Latent Variable Models: Example Unpaired

Empirical data distribution q (x) specified by

6 Implicit Latent Variable Models: Example Unpaired Image-to-Image Translation Prior distribution p (z) specified by images Z = {z m} M m=1 from one domain. Empirical data distribution q (x) specified by images X = {x n } N n=1 from another domain. (a) samples from p (z) (b) a sample from q (x) 5

7 Inference in Implicit Latent Variable Models Having specified the generative model, our aims are Optimize θ by maximizing marginal likelihood p θ (x) Infer hidden representations z by computing posterior p θ (z x) Both require intractable p θ (x) must resort to approximate inference Classical Variational Inference Approximate exact posterior p θ (z x) with variational posterior q ϕ (z x) Reduces inference problem to optimization problem min kl [q ϕ(z x) p θ (z x)] ϕ 6

8 Symmetric Joint-Matching Variational Inference

9 Joint-Matching Variational Inference Variational Joint Consider instead directly approximating the exact joint with variational joint q ϕ (x, z) = q ϕ (z x)q (x) variational posterior q ϕ (z x) also prescribed ϕ z n θ x n N 7

10 Symmetric Joint-Matching Variational Inference Minimize symmetric kl divergence between joints kl symm [p θ (x, z) q ϕ (x, z)] where kl symm [p q] = kl [p q] + kl [q p] }{{}}{{} forward kl reverse kl Why? 1. Because we can: kl symm [p θ (x, z) q ϕ (x, z)] tractable kl symm [p θ (z x) q ϕ (z x)] intractable 2. Helps avoid under/over-dispersed approximations (see paper for details) 8

11 Reverse kl Variational Objective Minimizing reverse kl divergence between joints equivalent to maximizing usual evidence lower bound (elbo), kl [q ϕ (x, z) p θ (x, z)] = E qϕ (x,z) [log q ϕ (x, z) log p θ (x, z)] Recall (negative) elbo, = E qϕ (x,z) [log q ϕ (z x) log p θ (x, z)] H[q (x)] }{{}}{{} L nelbo(θ,ϕ) constant L nelbo (θ, ϕ) = E q (x)q ϕ (z x)[ log p θ (x z)] + E q (x)kl [q ϕ (z x) p (z)] }{{}}{{} L nell(θ,ϕ) intractable kl term is intractable as prior p (z) is unavailable can only sample! 9

12 Forward kl Variational Objective Minimizing forward kl divergence between joints kl [p θ (x, z) q ϕ (x, z)] = E pθ (x,z) [log p θ (x, z) log q ϕ (x, z)] = E pθ (x,z) [log p θ (x z) log q ϕ (x, z)] H[p (z)] }{{}}{{} L naplbo(θ,ϕ) constant New variational objective, aggregate posterior lower bound (aplbo) L naplbo (θ, ϕ) = E p (z)p θ (x z)[ log q ϕ (z x)] + E p (z)kl [p θ (x z) q (x)] }{{}}{{} L nelp(θ,ϕ) intractable kl term is intractable as empirical data distribution q (x) is unavailable can only sample! 10

13 Density Ratio Estimation and f-divergence Approximation General f-divergence lower bound (Nguyen et al., 2010) For convex lower-semicontinuous function f : R + R, where E q (x)d f [p (z) q ϕ (z x)] }{{} intractable max α Llatent f (α; ϕ), }{{} tractable L latent f (α; ϕ) = E q (x)q ϕ (z x)[f (r α (z; x))] E q (x)p (z)[f (f (r α (z; x)))] Turns divergence estimation into an optimization problem Estimate divergence using a l.b. that just requires samples! r α is a neural net with parameters α, with equality at r α(z; x) = q ϕ(z x) p (z) 11

14 kl divergence lower bound Example: kl divergence lower bound For f(u) = u log u, we instantiate the kl lower bound where E q (x)kl [q ϕ (z x) p (z)] }{{} intractable max α Llatent kl (α; ϕ) } {{ } tractable L latent kl (α; ϕ) = E q (x)q ϕ (z x)[log r α (z; x)] E q (x)p (z)[r α (z; x) 1] Yields estimate of the elbo where all terms are tractable, L nelbo (θ, ϕ) = L nell (θ, ϕ) + E }{{} q (x)kl [q ϕ (z x) p (z)] }{{} tractable intractable max L nell(θ, ϕ) + L latent kl (α; ϕ) α }{{}}{{} tractable tractable 12

15 CycleGAN as a Special Case

16 Cycle-consistency as Conditional Probability Maximization For Gaussian likelihood and variational posterior p θ (x z) = N (x µ θ (z), τ 2 I), q ϕ (z x) = N (z m ϕ (x), t 2 I) Can instantiate l reverse const (θ, ϕ) from L nell (θ, ϕ) as posterior q ϕ (z x) degenerates (as t 0) Can instantiate l forward const (θ, ϕ) from L nelp (θ, ϕ) as likelihood p θ (x z) degenerates (as τ 0) Cycle-consistency corresponds to maximizing conditional probabilities: ell. forces q ϕ (z x) to place mass on hidden representations that recover the data elp. forces p θ (x z) to generate observations that recover the prior 13

17 Distribution Matching as Regularization For appropriate setting of f, and simplifying the mappings and discriminators, Can instantiate l reverse gan (α; ϕ) from L latent f (α; ϕ) Can instantiate l forward gan (β; θ) from L observed f (β; θ) Approximately minimizes intractable divergences: D f [p (z) q ϕ (z x)] forces q ϕ (z x) to match prior p (z) D f [q (x) p θ (x z)] forces p θ (x z) to match data q (x) Summary L nelbo (θ, ϕ) max L nell(θ, ϕ) α }{{} l reverse const (θ,ϕ) l forward const (θ,ϕ) latent + Lkl (α; ϕ) }{{} l reverse gan (α;ϕ) L naplbo (θ, ϕ) max L nelp(θ, ϕ) + L observed kl (β; θ) β }{{}}{{} l forward gan (β;θ) 14

18 Conclusion Formulated implicit latent variable models, which introduces implicit prior over latent variables Offers utmost degree of flexibility in incorporating prior knowledge Developed new paradigm for variational inference directly approximates exact joint distribution minimizes the symmetric kl divergence Provided theoretical treatment of the links between CycleGAN methods and Variational Bayes Poster Session To find out more, come visit us at our poster! Poster #14, Session 4 (17:10-18:00 Saturday, 14 July) 15

19 Questions? 15

20 References i References Kim, T., Cha, M., Kim, H., Lee, J. K., and Kim, J. (2017). Learning to Discover Cross-Domain Relations with Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70, pages Nguyen, X., Wainwright, M. J., and Jordan, M. I. (2010). Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization. IEEE Trans. Information Theory, 56(11): Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision (ICCV). 16

21 Symmetric Joint-Matching kl Minimization i kl divergence is asymmetric kl [p q] kl [q p] kl [q ϕ (z x) p θ (z x)] (reverse) underestimates support kl [p θ (z x) q ϕ (z x)] (forward) overestimates support Consider symmetric kl: kl symm [p q] = kl [p q] + kl [q p] Forward kl involves expectation under intractable posterior p θ (z x) what we re trying to approximate in the first place [ kl [p θ (z x) q ϕ (z x)] = E pθ (z x) log p ] θ(z x) q ϕ (z x)

22 Symmetric Joint-Matching kl Minimization ii Can show arg min ϕ arg min ϕ Already showed kl [q ϕ (z x) p θ (z x)] = arg min kl [q ϕ (x, z) p θ (x, z)] ϕ kl [p θ (z x) q ϕ (z x)] = arg min kl [p θ (x, z) q ϕ (x, z)] ϕ arg max ϕ L elbo (θ, ϕ) = arg min kl [q ϕ (x, z) p θ (x, z)] ϕ Can we find something similar for kl [p θ (x, z) q ϕ (x, z)]?

Lecture 14: Deep Generative Learning

Generative Modeling CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 14: Deep Generative Learning Density estimation Reconstructing probability density function using samples Bohyung Han