Lecture 14: Deep Generative Learning

Generative Modeling CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 14: Deep Generative Learning Density estimation Reconstructing probability density function using samples Bohyung Han Computer Vision Lab. bhhan@postech.ac.kr Sample generation 2 Training examples Generative samples Generative Modeling by Neural Networks Variational Auto Encoder (VAE) Neural networks with continuous latent variables Encoder: approximates a posterior distribution Decoder: stochastically reconstructs the data from the latent variables Generative Adversarial Networks (GANs) Generator: generates samples using a uniform distribution Discriminator: discriminates between real and generated images Deep Boltzmann Machines (DBMs) Deep Belief Networks (DBNs) Why Generative Modeling? Limitations of discriminative neural networks Requirement of large-scale labeled training data Unreasonable mistakes for trivial data variations Poor local optima Generative modeling Adjust model parameters to maximize the probability that the data would have been produced 3 4

Autoencoder Traditional autoencoders Feature learning Learning with the reconstruction error Basic pipeline Autoencoder Traditional autoencoders Feature learning Learning with the reconstruction error Supervised feature learning Input Features Reconstructed input Input Features Encoder! " Decoder! Encoder! " Classifier '!: image ": latent variable! : image!: image ": latent variable ': class label L =!! Fine-tuning by backpropagation 5 6 Variational Autoencoder Characteristics A Bayesian approach on an autoencoder: sample generation Estimating model parameter ( without access to latent variable " Problem scenario true prior +, ()) true conditional +, (0 )) Variational Autoencoder Characteristics A Bayesian approach on an autoencoder: sample generation Estimating model parameter ( without access to latent variable " Limitation true prior +, ()) true conditional +, (0 )) ) 0 ) 0 ): classes, attributes 0: image ): classes, attributes 0: image A value ) * is generated from a prior distribution +, ()). A value 0 * is generated from a conditional distribution +, (0 )). It is difficult to estimate +, ()) and +, (0 )) in practice. We need to approximate these distributions. True parameters 2 and the values of the latent variables ) 3 are hidden. 7 8

Variational Autoencoder Characteristics A Bayesian approach on an autoencoder: sample generation Estimating model parameter ( without access to latent variable " Main idea +, ()) +, (0 )) Decoder ) 0 Variational Autoencoder Characteristics A Bayesian approach on an autoencoder: sample generation Estimating model parameter ( without access to latent variable " Main idea 4 5 () 0) Encoder 0 ) Decoder +, (0 )) 0 ): classes, attributes 0: image 0: image ): classes, attributes 0 : image Assume the prior +, ()) is a unit Gaussian. Assume the likelihood +, (0 )) is a diagonal Gaussian. Decoder estimates mean and variance of +, (0 )). Encoder estimates mean and variance of 4 5 () 0). Decoder estimates mean and variance of +, (0 )). 9 10 Optimization Optimization Variational lower bound of marginal likelihood log +, 0 * = 9 :; ) 0 log +, 0 * = < => 4 5 ) 0 * +, ) 0 * + L (, A; 0 * Maximization of variational lower bound L (, A; 0 = 9 :; ) 0 log 4 5 ) 0 + log +, 0, ) = 9 :; ) 0 log 4 5 ) 0 + log + ) + log +, 0 ) KL divergence of the approximate from the true posterior Variational lower bound on the marginal likelihood = < => 4 5 ) 0 +, ) + 9 :; ) 0 log +, 0 ) Regularization term Reconstruction term log +, 0 * L (, A; 0 * Maximize the variational lower bound of the marginal likelihood. Training Reconstruct term: MSE between input and generated images Regularization term: optimization by reparametrization trick using mean and standard deviation 11 12

Variational Autoencoder Characteristics A Bayesian approach on an autoencoder: generate samples Estimating model parameter ( without access to latent variable " Inference +, ()) ) Decoder +, (0 )) 0 Learned Manifold Visualizations of learned manifold for generative models with 2D latent space ): classes, attributes 0: image Decoder samples mean and variance from +, ()). Decoder samples 0 from the mean and variance. 13 14 Generative Adversarial Networks Optimization 15 Two models A generative model D that captures the data distribution A discriminative model < that estimates the probability that a sample came from the training data rather than D Objective Maximize the probability of < making a mistake Trained with backpropagation Generator ) 0 Random noise Generated image Discriminator Train generator and 0 discriminator jointly! Real image [Goodfellow14] I. J. Goodfellow, et al.: Generative Adversarial Nets, NIPS 2014 Real or fake? 16 Objective O <, D 9 0STUVU 0 log < 0; ( G + 9 )S) ) log 1 < D ); ( E ; ( G min K max O <, D N < and D play the two-player minimax game with value function O <, D Simultaneously training D to minimize log 1 < D ); ( E ; ( G + ) ) : prior on input noise variables D ); ( E Representing a mapping from ) to data space A differentiable function represented by a multilayer perceptron with parameters ( E < 0; ( G Representing the probability that 0 came from data rather than generator distribution + E [Goodfellow14] I. J. Goodfellow, et al.: Generative Adversarial Nets, NIPS 2014

Saddle point estimation Optimization O <, D 9 0STUVU 0 log < 0; ( G + 9 )S) ) log 1 < D ); ( E ; ( G min K max O <, D N Optimization Algorithm 1 Minibatch stochastic gradient descent training of generative adversarial nets. The number of steps to apply to the discriminator, k, is a hyperparameter. We used k =1, the least expensive option, in our experiments. for number of training iterations do for k steps do Sample minibatch of m noise samples {z (1),...,z (m) } from noise prior p g (z). Sample minibatch of m examples {x (1),...,x (m) } from data generating distribution p data (x). Update the discriminator by ascending its stochastic gradient: r d 1 m mx i=1 h log D x (i) + log 1 D G z (i) i. end for Sample minibatch of m noise samples {z (1),...,z (m) } from noise prior p g (z). Update the generator by descending its stochastic gradient: r g 1 m mx i=1 log 1 D G z (i). end for The gradient-based updates can use any standard gradient-based learning rule. We used momentum in our experiments. 17 [Goodfellow14] I. J. Goodfellow, et al.: Generative Adversarial Nets, NIPS 2014 18 [Goodfellow14] I. J. Goodfellow, et al.: Generative Adversarial Nets, NIPS 2014 Generated Examples Laplacian Generative Adversarial Networks Generating high-resolution images progressively WX Y = Z WX Y[\ + h^y = Z WX Y[\ + D Y " Y, Z WX Y[\ Residual image Upsampling Generative model Nearest neighbors from training set I 0 64x64 h 0 G 0 z 0 l 0 I 1 h 1 G 1 z 1 l 1 I 2 l 2 I 3 h 2 G 2 G 3 z 2 z 3 19 [Goodfellow14] I. J. Goodfellow, et al.: Generative Adversarial Nets, NIPS 2014 [Denton15] E. Denton, S. Chintalal, A. Szlam, R. Fergus: Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks, NIPS 2015 20

Laplacian Generative Adversarial Networks Results Training: the opposite direction Model at each level is trained independently. Blur and downsample Upsample I 2 I 2 I 3 I 3 I = I 0 I 1 I 1 z 0 z 1 z 2 I 3 l 2 D 3 Real/ Generated? h 0 l 0 h 0 G 0 h 1 l 1 D 1 h 1 G 1 Real/Generated? D 2 G 2 G 3 h 2 h 2 z 3 Real/ Generated? CC-LAPGAN: Truck LAPGAN GAN [14] D 0 Real/Generated? [Denton15] E. Denton, S. Chintalal, A. Szlam, R. Fergus: Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks, NIPS 2015 21 Nearest neighbors from training set [Denton15] E. Denton, S. Chintalal, A. Szlam, R. Fergus: Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks, NIPS 2015 22 DCGAN Representation learning by GAN Unsupervised representation learning Learning from scratch First CNN-based GAN model with good performance Apply to classification Combine discriminator and any classifier for prediction Good performance even without fine tuning Generator Discriminator DCGAN Architecture guidelines for stable DCGANs Replace pooling layers with strided convolutions (in discriminator) and fractional-strided convolutions (in generator). Use batch normalization in both the generator and the discriminator. Remove fully connected hidden layers for deeper architectures. Use ReLU in generator for all layers except for the output, which uses tanh. Use LeakyReLU activation in the discriminator for all layers. Real or fake? [Radford16] A. Radford, L. Metz, S. Chintala: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ICLR 2016 23 [Radford16] A. Radford, L. Metz, S. Chintala: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ICLR 2016 24

Results Interpolation in Latent Space On CIFAR-10 using trained model with ImageNet-1K Model 1 Layer K-means 3 Layer K-means Learned RF View Invariant K-means Exemplar CNN DCGAN (ours) + L2-SVM Accuracy 80.6% 82.0% 81.9% 84.3% 82.8% Accuracy (400 per class) 63.7% (±0.7%) 70.7% (±0.7%) 72.6% (±0.7%) 77.4% (±0.2%) 73.8% (±0.4%) max # of features units 4800 3200 6400 1024 512 On SVHN with 1000 labels Model KNN TSVM M1+KNN M1+TSVM M1+M2 SWWAE without dropout SWWAE with dropout DCGAN (ours) + L2-SVM Supervised CNN with the same architecture error rate 77.93% 66.55% 65.63% 54.33% 36.02% 27.83% 23.56% 22.48% 28.87% (validation) [Radford16] A. Radford, L. Metz, S. Chintala: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ICLR 2016 [Radford16] A. Radford, L. Metz, S. Chintala: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ICLR 2016 Vector Arithmetic for Visual Concepts CycleGAN 25 26 Image-to-image translation Converting an image from one representation to another e.g., gray to color, image to semantic labels, edge map to photograph Generative adversarial network Cycle consistency Using transitivity as a way to regularize structured data [Radford16] A. Radford, L. Metz, S. Chintala: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ICLR 2016 27 Monet photo zebra horse summer photo Monet horse zebra winter winter summer [Zhu17] J.-Y. Zhu, T. Park, P. Isola, A. A. Efros: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. ICCV 2017 28

Unpaired Translation There is some underlying relationship between domains. Seeking the relationship without explicit mapping Learning to translate between domains without input-output pairs Using supervision in the level of sets Paired { x i y i },, { } { }, Unpaired { X } { Y } [Zhu17] J.-Y. Zhu, T. Park, P. Isola, A. A. Efros: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. ICCV 2017 29, Use bidirectional GAN and Introduce cycle consistency loss! min K,j cycle-consistency loss max L D, f, < g, < b N k,n l X { Formulation by GAN L _`a D, < b, c, d + L _`a f, < g, d, c + L hih D, f L _`a D, < b, c, d = 9 estuvu e log < b e + 9 0STUVU 0 log 1 < b D 0 L _`a f, < g, d, c = 9 0STUVU 0 log < g 0 + 9 estuvu e log 1 < g D e Y D X X G F [Zhu17] J.-Y. Zhu, T. Park, P. Isola, A. A. Efros: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. ICCV 2017 30 D Y Y X { Y cycle-consistency loss L hih D, f = 9 0STUVU 0 f D 0 0 \ + 9 estuvu e D f e e \ Results of CycleGAN Wasserstein GAN Input Cycle alone GAN alone GAN+forward GAN+backward CycleGAN (ours) Ground truth Original GAN Discriminator minimize Jensen-Shannon (JS) divergence between the real data distribution and the generated distribution mn + o, +, = 1 2 qr(+ o + s + 1 2 qr(+, + s winter Yosemite summer Yosemite = 1 2 qr + \ + o + +, 2 + 1 2 qr +, + o + +, 2 + o : real data distribution, +, : generated distribution Not well suited for optimization: ill-defined gradients in all situations Wasserstein GAN Discriminator minimizes Earth Mover s distance (EMD) t + o, +, = inf 9 {, v! ' v x S y,s z summer Yosemite winter Yosemite [Zhu17] J.-Y. Zhu, T. Park, P. Isola, A. A. Efros: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. ICCV 2017 31 Π + o, +, : a set of joint distribution of + o and +, Gradients are much more well-defined. [Arjovsky17] M. Arjovsky, S. Chintala, L. Bottou: Wasserstein Generative Adversarial Networks. ICML 2017 32

Wasserstein GAN EMD Optimal transport cost from one distribution to another Jensen-Shannon divergence vs earth mover s distance 1-Lipschitz constraint Wasserstein GAN t + o, +, = inf v x S y,s z 9 {, v! ' t + o, +, ã-lipschitz constraint = sup Å ÇÉÑ 9 {Sy Ö! 9 {Sz Ö! Kantorovich-Rubinstein Duality min, max Ü áy,y à 9 {S y Ö Ü! 9 {Sz Ö Ü (â, " ) Generator 0 Random noise Fake image Critic network Ö Ü Ö Ü! or Ö Ü â, " 0 ä JS EMD [Arjovsky17] M. Arjovsky, S. Chintala, L. Bottou: Wasserstein Generative Adversarial Networks. ICML 2017 33 Real image [Arjovsky17] M. Arjovsky, S. Chintala, L. Bottou: Wasserstein Generative Adversarial Networks. ICML 2017 34 WGAN VAE vs. GAN WGAN vs. GAN Generator: MLP with 4 hidden layers Discriminator: DCGAN Variational Autoencoder (VAE) Generative Adversarial Network (GAN) [Arjovsky17] M. Arjovsky, S. Chintala, L. Bottou: Wasserstein Generative Adversarial Networks. ICML 2017 35 36

VAE vs. GAN VAE GAN Pros Stable training Clear evaluation metrics High quality generation Less overfitting problem Cons Crude generation quality Overfitting Slow training Mode collapsing Unstable training No good evaluation metrics 37 38