B PROOF OF LEMMA 1. Published as a conference paper at ICLR 2018

Similar documents
arxiv: v5 [cs.lg] 11 Jul 2018

A Unified View of Deep Generative Models

ON UNIFYING DEEP GENERATIVE MODELS

Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

Variational Autoencoders (VAEs)

Energy-Based Generative Adversarial Network

Lecture 14: Deep Generative Learning

Unsupervised Learning

Generative adversarial networks

Deep Generative Models

Generative Adversarial Networks

Denoising Criterion for Variational Auto-Encoding Framework

Chapter 20. Deep Generative Models

arxiv: v2 [stat.ml] 15 Aug 2017

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,

Wasserstein GAN. Juho Lee. Jan 23, 2017

Bayesian Semi-supervised Learning with Deep Generative Models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Variational Autoencoders

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Deep Generative Models. (Unsupervised Learning)

Variational Inference via Stochastic Backpropagation

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

MMD GAN 1 Fisher GAN 2

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Variational Autoencoder

The Success of Deep Generative Models

AdaGAN: Boosting Generative Models

Generative Adversarial Networks, and Applications

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 16 Deep Neural Generative Models

Contrastive Divergence

GENERATIVE ADVERSARIAL LEARNING

CS60010: Deep Learning

Machine Learning Summer 2018 Exercise Sheet 4

Auto-Encoding Variational Bayes

Probabilistic Graphical Models for Image Analysis - Lecture 4

Data Mining Techniques

Restricted Boltzmann Machines

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS

Gradient-Based Learning. Sargur N. Srihari

Supplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization

arxiv: v3 [stat.ml] 30 Jun 2017

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning to Sample Using Stein Discrepancy

Unsupervised Learning

Improved Bayesian Compression

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

CMU-Q Lecture 24:

Variational Dropout via Empirical Bayes

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Variational Auto Encoders

Variational Inference and Learning. Sargur N. Srihari

arxiv: v2 [cs.ne] 22 Feb 2013

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

An Introduction to Bayesian Machine Learning

Probabilistic Reasoning in Deep Learning

A DARK GREY P O N T, with a Switch Tail, and a small Star on the Forehead. Any

Nonparametric Inference for Auto-Encoding Variational Bayes

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

CLOSE-TO-CLEAN REGULARIZATION RELATES

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Variational Autoencoders

Variational Inference (11/04/13)

Generative Models for Sentences

A Geometric View of Conjugate Priors

G8325: Variational Bayes

Latent Variable Models

Privacy-Preserving Adversarial Networks

Variational Scoring of Graphical Model Structures

But if z is conditioned on, we need to model it:

Deep Learning for Computer Vision

13: Variational inference II

The Origin of Deep Learning. Lili Mou Jan, 2015

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

arxiv: v7 [cs.lg] 27 Jul 2018

Machine Learning Lecture Notes

Deep Neural Networks (1) Hidden layers; Back-propagation

STA 4273H: Statistical Machine Learning

BACKPROPAGATION THROUGH THE VOID

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

An Introduction to Expectation-Maximization

Generative models for missing value completion

doubly semi-implicit variational inference

A Geometric View of Conjugate Priors

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Advanced Machine Learning & Perception

Generative v. Discriminative classifiers Intuition

Consistency of Nearest Neighbor Methods

topics about f-divergence

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Variational Inference for Monte Carlo Objectives

STA 4273H: Statistical Machine Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University

Variational Autoencoders for Classical Spin Models

Transcription:

A ADVERSARIAL DOMAIN ADAPTATION (ADA ADA aims to transfer prediction nowledge learned from a source domain with labeled data to a target domain without labels, by learning domain-invariant features. Let D φ (x = q φ (y x be the domain discriminator. The conventional formulation of ADA is as following: max φ L φ = E x=gθ (z,z p(z y=1 log D φ (x + E x=gθ (z,z p(z y=0 log(1 D φ (x, max θ L θ = E x=gθ (z,z p(z y=1 log(1 D φ (x + E x=gθ (z,z p(z y=0 log D φ (x. Further add the supervision objective of predicting label t(z of data z in the source domain, with a classifier f ω (t x parameterized with π: (18 max ω,θ L ω,θ = E z p(z y=1 log f ω(t(z G θ (z. (19 We then obtain the conventional formulation of adversarial domain adaptation used or similar in (Ganin et al., 2016; Purushotham et al., 2017. B PROOF OF LEMMA 1 Proof. where E pθ (x yp(y log q r (y x = E p(y KL (p θ (x y q r (x y KL(p θ (x y p θ0 (x, E p(y KL(p θ (x y p θ0 (x ( = p(y = 0 KL p θ (x y = 0 p θ 0 (x y = 0 + p θ0 (x y = 1 2 ( + p(y = 1 KL p θ (x y = 1 p θ 0 (x y = 0 + p θ0 (x y = 1. 2 Note that p θ (x y = 0 = p gθ (x, and p θ (x y = 1 = p data (x. Let be simplified as: On the other hand, Note that (20 (21 = pg θ +p data 2. Eq.(21 can E p(y KL(p θ (x y p θ0 (x = 1 2 KL ( p gθ 0 + 1 2 KL ( p data 0. (22 JSD(p gθ p data = 1 2 Epg θ = 1 2 Epg θ + 1 2 Ep data = 1 2 Epg θ log pg θ log pg θ 0 + 1 2 Ep data log p data log p data log pg θ 0 0 Taing derivatives of Eq.(22 w.r.t θ at θ 0 we get + 1 2 Epg θ + 1 2 Ep data + 1 2 Ep data log pm θ 0 log pm θ 0 log p data 0 = 1 2 KL ( p gθ 0 + 1 2 KL ( p data 0 KL + E pmθ log pm θ 0 ( 0. (23 θ KL ( 0 θ=θ0 = 0. (24 θ E p(y KL(p θ (x y p θ0 (x θ=θ0 = θ ( 1 2 KL ( p gθ 0 θ=θ0 + 1 2 KL (p data 0 θ=θ0 (25 = θ JSD(p gθ p data θ=θ0. Taing derivatives of the both sides of Eq.(20 at w.r.t θ at θ 0 and plugging the last equation of Eq.(25, we obtain the desired results. 14

q (y x/q r (y xqr(r (y x q Published as a conference paper at ICLR 2018 (r (y x q (y x q(r(z x, y q (y x r q (z x, y (y xy pq (x z, q r (y x (y x pq (x z, y p (x z, y (r (r qp (x z, (y zy (x z, y qp (z y q (r (y z q (z y Figure 4: Left: Graphical model of InfoGAN. Right: Graphical model of Adversarial Autoencoder (AAE, which is obtained by swapping data x and code z in InfoGAN. C P ROOF OF JSD U PPER B OUND IN L EMMA 1 We show that, in Lemma.1 (Eq.6, the JSD term is upper bounded by the KL term, i.e., JSD(pθ (x y = 0pθ (x y = 1 Ep(y KL(pθ (x yq r (x y. (26 Proof. From Eq.(20, we have Ep(y KL(pθ (x ypθ0 (x Ep(y KL (pθ (x yq r (x y. (27 From Eq.(22 and Eq.(23, we have JSD(pθ (x y = 0pθ (x y = 1 Ep(y KL(pθ (x ypθ0 (x. (28 Eq.(27 and Eq.(28 lead to Eq.(26. D S CHEMATIC G RAPHICAL M ODELS AND AAE/PM/C YCLE GAN Adversarial Autoencoder (AAE (Mahzani et al., 2015 can be obtained by swapping code variable z and data variable x of InfoGAN in the graphical model, as shown in Figure 4. To see this, we directly write down the objectives represented by the graphical model in the right panel, and show they are precisely the original AAE objectives proposed in (Mahzani et al., 2015. We present detailed derivations, which also serve as an example for how one can translate a graphical model representation to the mathematical formulations. Readers can do similarly on the schematic graphical models of GANs, InfoGANs, VAEs, and many other relevant variants and write down the respective objectives conveniently. We stic to the notational convention in the paper that parameter θ is associated with the distribution over x, parameter η with the distribution over z, and parameter φ with the distribution over y. Besides, we use p to denote the distributions over x, and q the distributions over z and y. From the graphical model, the inference process (dashed-line arrows involves implicit distribution qη (z y (where x is encapsulated. As in the formulations of GANs (Eq.4 in the paper and VAEs (Eq.13 in the paper, y = 1 indicates the real distribution we want to approximate and y = 0 indicates the approximate distribution with parameters to learn. So we have ( qη (z y = 0 y = 0 qη (z y = q(z y = 1, (29 where, as z is the hidden code, q(z is the prior distribution over z 1, and the space of x is degenerated. Here qη (z y = 0 is the implicit distribution such that z qη (z y = 0 z = Eη (x, x pdata (x, (30 1 See section 6 of the paper for the detailed discussion on prior distributions of hidden variables and empirical distribution of visible variables 15

where E η (x is a deterministic transformation parameterized with η that maps data x to code z. Note that as x is a visible variable, the pre-fixed distribution of x is the empirical data distribution. On the other hand, the generative process (solid-line arrows involves p θ (x z, yq (r φ (y z (here q(r means we will swap between q r and q. As the space of x is degenerated given y = 1, thus p θ (x z, y is fixed without parameters to learn, and θ is only associated to y = 0. With the above components, we maximize the log lielihood of the generative distributions log p θ (x z, yq (r φ (y z conditioning on the variable z inferred by q η(z y. Adding the prior distributions, the objectives are then written as max φ L φ = E qη(z yp(y log p θ (x z, yq φ (y z max θ,η L θ,η = E qη(z yp(y log pθ (x z, yqφ(y z r. (31 Again, the only difference between the objectives of φ and {θ, η} is swapping between q φ (y z and its reverse q r φ (y z. To mae it clearer that Eq.(31 is indeed the original AAE proposed in (Mahzani et al., 2015, we transform L φ as max φ L φ = E qη(z yp(y log q φ (y z = 1 2 E q η(z y=0 log q φ (y = 0 z + 1 2 E q η(z y=1 log q φ (y = 1 z (32 = 1 2 E z=e η(x,x p data (x log q φ (y = 0 z + 1 2 E z q(z log q φ (y = 1 z. That is, the discriminator with parameters φ is trained to maximize the accuracy of distinguishing the hidden code either sampled from the true prior p(z or inferred from observed data example x. The objective L θ,η optimizes θ and η to minimize the reconstruction loss of observed data x and at the same time to generate code z that fools the discriminator. We thus get the conventional view of the AAE model. Predictability Minimization (PM (Schmidhuber, 1992 is the early form of adversarial approach which aims at learning code z from data such that each unit of the code is hard to predict by the accompanying code predictor based on remaining code units. AAE closely resembles PM by seeing the discriminator as a special form of the code predictors. CycleGAN (Zhu et al., 2017 is the model that learns to translate examples of one domain (e.g., images of horse to another domain (e.g., images of zebra and vice versa based on unpaired data. Let x and z be the variables of the two domains, then the objectives of AAE (Eq.31 is precisely the objectives that train the model to translate x into z. The reversed translation is trained with the objectives of InfoGAN (Eq.9 in the paper, the symmetric counterpart of AAE. E PROOF OF LEMME 2 Proof. For the reconstruction term: E pθ0 (x Eqη(z x,yq r (y x log p θ (x z, y = 1 2 E p θ0 (x y=1 Eqη(z x,y=0,y=0 q r (y x log p θ (x z, y = 0 + 1 2 E p θ0 (x y=0 Eqη(z x,y=1,y=1 q r (y x log p θ (x z, y = 1 (33 = 1 2 E p data (x E qη(z x log p θ (x z + const, where y = 0 q r (y x means q r (y x predicts y = 0 with probability 1. Note that both q η (z x, y = 1 and p θ (x z, y = 1 are constant distributions without free parameters to learn; q η (z x, y = 0 = q η (z x, and p θ (x z, y = 0 = p θ (x z. 16

For the KL prior regularization term: E pθ0 (x KL(q η(z x, yq (y x p(z yp(y r = E pθ0 (x q (y xkl r (q η(z x, y p(z y dy + KL (q (y x p(y r = 1 2 E p θ0 (x y=1 KL (q η(z x, y = 0 p(z y = 0 + const + 1 2 E p θ0 (x y=1 const (34 = 1 2 E p data (x KL( q η(z x p(z. Combining Eq.(33 and Eq.(34 we recover the conventional VAE objective in Eq.(7 in the paper. F VAE/GAN JOINT MODELS FOR MODE MISSING/COVERING Previous wors have explored combination of VAEs and GANs. This can be naturally motivated by the asymmetric behaviors of the KL divergences that the two algorithms aim to optimize respectively. Specifically, the VAE/GAN joint models (Larsen et al., 2015; Pu et al., 2017 that improve the sharpness of VAE generated images can be alternatively motivated by remedying the mode covering behavior of the KLD in VAEs. That is, the KLD tends to drive the generative model to cover all modes of the data distribution as well as regions with small values of p data, resulting in blurred, implausible samples. Incorporation of GAN objectives alleviates the issue as the inverted KL enforces the generator to focus on meaningful data modes. From the other perspective, augmenting GANs with VAE objectives helps addressing the mode missing problem, which justifies the intuition of (Che et al., 2017a. G IMPORTANCE WEIGHTED GANS (IWGAN From Eq.(6 in the paper, we can view GANs as maximizing a lower bound of the marginal log-lielihood on y: log q(y = log p θ (x y qr (y xp θ0 (x dx p θ (x y p θ (x y log qr (y xp θ0 (x (35 dx p θ (x y = KL(p θ (x y q r (x y + const. We can apply the same importance weighting method as in IWAE (Burda et al., 2015 to derive a tighter bound. 1 q r (y x ip θ0 (x i log q(y = log E p θ (x i y E log 1 q r (y x ip θ0 (x i p θ (x i y (36 = E log 1 w i := L (y where we have denoted w i = qr (y x ip θ0 (x i p θ (x i y, which is the unnormalized importance weight. We recover the lower bound of Eq.(35 when setting = 1. To maximize the importance weighted lower bound L (y, we tae the derivative w.r.t θ and apply the reparameterization tric on samples x: θ L (y = θ E x1,...,x log 1 w i = E z1,...,z θ log 1 w(y, x(z i, θ (37 = E z1,...,z w i θ log w(y, x(z i, θ, 17

where w i = w i / w i are the normalized importance weights. We expand the weight at θ = θ 0 w i θ=θ0 = qr (y x ip θ0 (x i p θ (x i y = q r 1 (y x p 2 θ 0 (x i y = 0 + 1 p 2 θ 0 (x i y = 1 i p θ0 (x i y θ=θ0. (38 The ratio of p θ0 (x i y = 0 and p θ0 (x i y = 1 is intractable. Using the Bayes rule and approximating with the discriminator distribution, we have Plug Eq.(39 into the above we have In Eq.(37, the derivative θ log w i is p(x y = 0 p(y = 0 xp(y = 1 q(y = 0 x = p(x y = 1 p(y = 1 xp(y = 0 q(y = 1 x. (39 w i θ=θ0 qr (y x i q(y x i. (40 θ log w(y, x(z i, θ = θ log q r (y x(z i, θ + θ log p θ 0 (x i p θ (x i y. (41 The second term in the RHS of the equation is intractable as it involves evaluating the lielihood of implicit distributions. However, if we tae = 1, it can be shown that E p(yp(z y θ log p θ 0 (x(z, θ p θ (x(z, θ y θ=θ 0 1 = θ 2 E p θ0 (x p θ (x y=0 + 1 p θ (x y = 0 2 E p θ0 (x (42 p θ (x y=1 θ=θ0 p θ (x y = 1 = θ JSD(p gθ (x p data (x θ=θ0, where the last equation is based on Eq.(23. That is, the second term in the RHS of Eq.(41 is (when = 1 indeed the gradient of the JSD, which is subtracted away in the standard GANs as shown in Eq.(6 in the paper. We thus follow the standard GANs and also remove the second term even when > 1. Therefore, the resulting update rule for the generator parameter θ is θ L (y = E z1,...,z p(z y wi θ log qφ r 0 (y x(z i, θ. (43 H ADVERSARY ACTIVATED VAES (AAVAE In our formulation, VAEs include a degenerated adversarial discriminator which blocs out generated samples from contributing to model learning. We enable adaptive incorporation of fae samples by activating the adversarial mechanism. Again, derivations are straightforward by maing symbolic analog to GANs. We replace the perfect discriminator q (y x in vanilla VAEs with the discriminator networ q φ (y x parameterized with φ as in GANs, resulting in an adapted objective of Eq.(12 in the paper: max θ,η L aavae θ,η = E pθ0 (x E qη(z x,yq φ r (y x log p θ (x z, y KL(q η(z x, yqφ(y x p(z yp(y r. (44 The form of Eq.(44 is precisely symmetric to the objective of InfoGAN in Eq.(9 with the additional KL prior regularization. Before analyzing the effect of adding the learnable discriminator, we first loo at how the discriminator is learned. In analog to GANs in Eq.(3 and InfoGANs in Eq.(9, the objective of optimizing φ is obtained by simply replacing the inverted distribution qφ r (y x with q φ (y x: max φ L aavae φ = E pθ0 (x E qη(z x,yqφ (y x log p θ (x z, y KL(q η(z x, yq φ (y x p(z yp(y. (45 Intuitively, the discriminator is trained to distinguish between real and fae instances by predicting appropriate y that selects the components of q η (z x, y and p θ (x z, y to best reconstruct x. The difficulty of Eq.(45 is that p θ (x z, y = 1 = p data (x is an implicit distribution which is intractable for lielihood evaluation. We thus use the alternative objective as in GANs to train a binary classifier: max φ L aavae φ = E pθ (x z,yp(z yp(y log q φ (y x. (46 18

I EXPERIMENTS I.1 IMPORTANCE WEIGHTED GANS We extend both vanilla GANs and class-conditional GANs (CGAN with the importance weighting method. The base GAN model is implemented with the DCGAN architecture and hyperparameter setting (Radford et al., 2015. We do not tune the hyperparameters for the importance weighted extensions. We use MNIST, SVHN, and CIFAR10 for evaluation. For vanilla GANs and its IW extension, we measure inception scores (Salimans et al., 2016 on the generated samples. We train deep residual networs provided in the tensorflow library as evaluation networs, which achieve inception scores of 9.09, 6.55, and 8.77 on the test sets of MNIST, SVHN, and CIFAR10, respectively. For conditional GANs we evaluate the accuracy of conditional generation (Hu et al., 2017. That is, we generate samples given class labels, and then use the pre-trained classifier to predict class labels of the generated samples. The accuracy is calculated as the percentage of the predictions that match the conditional labels. The evaluation networs achieve accuracy of 0.990 and 0.902 on the test sets of MNIST and SVHN, respectively. I.2 ADVERSARY ACTIVATED VAES We apply the adversary activating method on vanilla VAEs, class-conditional VAEs (CVAE, and semi-supervised VAEs (SVAE (Kingma et al., 2014. We evaluate on the MNIST data. The generator networs have the same architecture as the generators in GANs in the above experiments, with sigmoid activation functions on the last layer to compute the means of Bernoulli distributions over pixels. The inference networs, discriminators, and the classifier in SVAE share the same architecture as the discriminators in the GAN experiments. We evaluate the lower bound value on the test set, with varying number of real training examples. For each minibatch of real examples we generate equal number of fae samples for training. In the experiments we found it is generally helpful to smooth the discriminator distributions by setting the temperature of the output sigmoid function larger than 1. This basically encourages the use of fae data for learning. We select the best temperature from {1, 1.5, 3, 5} through cross-validation. We do not tune other hyperparameters for the adversary activated extensions. Table 4 reports the full results of SVAE and AA-SVAE, with the average classification accuracy and standard deviations over 5 runs. 1% 10% SVAE 0.9412±.0039 0.9768±.0009 AASVAE 0.9425±.0045 0.9797±.0010 Table 4: Classification accuracy of semi-supervised VAEs and the adversary activated extension on the MNIST test set, with varying size of real labeled training examples. 19