arxiv: v4 [stat.ml] 16 Sep 2018

Size: px

Start display at page:

Download "arxiv: v4 [stat.ml] 16 Sep 2018"

Kevin Wiggins
5 years ago
Views:

1 Relaxed Wasserstein with Applications to GANs in Guo Johnny Hong Tianyi Lin Nan Yang September 9, 2018 ariv: v4 [stat.ml] 16 Sep 2018 Abstract We propose a novel class of statistical divergences called Relaxed Wasserstein (RW) divergence. RW divergence generalizes Wasserstein divergence and is parametrized by a class of strictly convex and differentiable functions. We establish for RW divergence several probabilistic properties, which are critical for the success of Wasserstein divergence. In particular, we show that RW divergence is dominated by Total Variation (TV) and Wasserstein-L 2 divergence, and that RW divergence has continuity, differentiability and duality representation. Finally, we provide a non-asymptotic moment estimate and a concentration inequality for RW divergence. Our experiments on image generation demonstrate that RW divergence is a suitable choice for GANs. The performance of RWGANs with Kullback-Leibler (KL) divergence is competitive with other state-of-the-art GANs approaches. Moreover, RWGANs possess better convergence properties than the existing WGANs with competitive inception scores. To the best of our knowledge, this new conceptual framework is the first to provide not only the flexibility in designing effective GANs scheme, but also the possibility in studying different loss functions under a unified mathematical framework. 1 Introduction GANs. Generative Adversarial Networks (GANs) [16] provide a versatile class of models for generative modeling. Since their introduction to the machine learning community, the popularity of GANs have grown exponentially with numerous applications. Examples include high resolution image generation [12, 31], image inpainting [37], image super-resolution [21], visual manipulation [39], text-to-image synthesis [32], video generation [36], semantic segmentation [23], and abstract reasoning diagram generation [15]. See also [1, 22, 25] for more details on the training dynamics of GANs. The key idea behind GANs is to interpret the process of generative modeling as a competing game between two networks: a generator network and a discriminator network. The generator network attempts to fool Department of Industrial Engineering and Operations Research, University of California, Berkeley, USA. xinguo@berkeley.edu. Department of Statistics, University of California, Berkeley, USA. jcyhong@berkeley.edu. Department of Industrial Engineering and Operations Research, University of California, Berkeley, USA. darren_ lin@berkeley.edu. Department of Industrial Engineering and Operations Research, University of California, Berkeley, USA. nanyang@berkeley.edu. 1

2 the discriminator network by converting random noise into sample data while the discriminator network tries to identify whether the input sample is a faked data sample or a true data sample. There are many variants of GANs. Least square GANs (LSGANs) [24] attain a stable performance during the learning process, replacing the sigmoid cross entropy loss by the least square loss in the discriminator network of the original GANs. They are also shown to generate higher quality images than the original GANs in practice. DRAGANs [20] alleviate the instability of the GANs training and offer a clear game-theoretic justification, by introducing regret minimization to reach the equilibrium in games and to further explain the reason for the success of simultaneous gradient descent in GANs. Conditional GANs (CGANs) [27] are the conditional version of GANs, proposing to stabilize the training by imposing the control on modes of the generated data in an original generative model. Information-theoretic GANs (InfoGANs) [10] are an information-theoretic extensions to the GANs, providing highly semantic and meaningful hidden representations on a number of image datasets, by maximizing the mutual information between a fixed small subset of GAN s noises and the observations. Auxiliary Classifier GANs (ACGANs) [29] improve GANs by adding more structure to the latent space together with a specialized cost function, with high-quality samples. They also lead to a new analysis for assessing the discriminability and diversity of samples from class-conditional image synthesis models. Energy-Based GANs (EBGANs) [38] propose a new energy perspective of of GANs. They construct an energy function to measure the discriminator that attributes lower energies to the regions near the data manifold and higher energies to other regions. As a result, the EBGAN framework is shown to generate reasonable high-resolution images without a multi-scale approach. Boundary Equilibrium GANs (BEGANs) [5] adopt a new equilibrium enforcing method paired with the Wasserstein divergence to train GANs with an auto-encoder. This approach not only balances the generator network and the discriminator network but also uncovers a novel approximate convergence measure, leading to a fast and stable training with high visual-quality. WGANs. A recurring theme to improve GANs training is the choice of loss functions. The first proposed class of loss functions is based on the Jensen-Shannon (JS) divergence, which is essentially the symmetric version of the Kullback-Leibler (KL) divergence. It is shown in [2] that JS divergence is undesirable with unstable training, suggesting Wasserstein-L 1 divergence as an alternative. The resulting Wasserstein GANs (WGANs) outperform the original GANs in several aspects. The Wasserstein-L 1 divergence is continuous, differentiable and has a duality representation, allowing a very stable gradient flow in the process of training. Besides the stability, the Wasserstein-L 1 divergence also avoids the issue of mode collapse and further provides meaningful learning curves that can be used for debugging and for hyperparameter searching. With additional weight clipping [2] and gradient penalty [17], the volatility of the gradient is somehow controlled. Our work. We propose a novel class of statistical divergence called Relaxed Wasserstein (RW) divergence. RW divergence is Wasserstein divergence parametrized by a class of strictly convex and differentiable functions which contain different curvature information. Naturally, RW divergence provides more flexibility and possibilities in generative modeling. To ensure that RW divergence is a viable option for comparing probability distributions and is competitive with other Wasserstein divergence in the generative modeling, 2

3 this paper addresses the following theoretical questions along with related computational issues. Does the gradient of RW divergence exist and allows for an explicit form? Does RW divergence enjoy the same mathematical properties as the standard Wasserstein divergence? Does RW divergence have the duality representation as Wasserstein divergence? In this paper, we first show that RW divergence is dominated by the total variation (TV) distance and squared Wasserstein-L 2 divergence (Theorem 3.1). We then obtain its nonasymptotic moment estimate (Theorem 3.2), its concentration inequality (Theorem 3.3), and and its duality representation (Theorem 3.6). For application purpose, we show the existence of the gradient of RW divergence by first establishing its continuity and differentiability (Theorem 3.5). These properties ensure the gradient descent procedure, with an explicit formula for the gradient evaluation and an asymmetric clipping (Corollary 3.6.1). This asymmetric clipping is useful for controlling the volatility of the gradient. Finally, we compare the RWGANs with several state-of-the-art GANs in image generation. We use RWGANs with KL divergence and the architectures of DCGAN and MLP. We first evaluate all of candidate methods on MNIST and Fashion-MNIST datasets and show that RWGANs are competitive with other popular approaches. Then we conduct the experiment on CIFAR-10 and ImageNet datasets to investigate if RWGANs outperform WGANs with symmetric clipping and gradient penalty, denoted as WGANs and WGANs-GP respectively. Our numerical results suggest that RWGANs strike a balance between WGANs and WGANs-GP: WGANs-GP fail to converge although they can achieve the fastest rate of training, RWGANs are very robust and converge faster than WGANs. Therefore, RWGANs are more desirable for large-scale computations. Furthermore, RWGANs attain the highest inception scores at the initial stage of training on CIFAR-10, meaning that the generated samples correlate well with human evaluations [33]. As a byproduct, our experiment provides some evidences that an appropriate weight clipping has a potential to be competitive with gradient penalty in WGANs. Open question. Theoretically this new conceptual framework provides a unified mathematical framework to implement and investigate different Wasserstein divergences. Such flexibility raises a natural question on whether the underlying convex function φ can be determined in advance based on data samples and problem structure. While the main focus of this paper is the application to GANs, we believe that the theoretical results of RW divergence can be a valuable addition to the rich theory for optimal transport, where regularities of Wasserstein-based cost functions have been extensively studied [7, 8, 9, 35]. Organizations. The rest of the paper is organized as follows. Section 2 provides the preliminaries and notations that will be used throughout the paper. Section 3 describes the RW divergence and discusses its theoretical properties. In Sections 4.1 and 4.2, we discuss the implementation of the method and present two numerical studies on real data examples. Section 5 concludes our paper. 2 Background In this section, we review the definitions and properties of Bregman divergence and Wasserstein divergence. 3

4 2.1 Notations Throughout the paper, the following notations are used unless otherwise stated. We denote x R d as a vector in Euclidean space and as a matrix. x denotes the transpose of a vector x and log(x) denotes the component-wise logarithm of a vector x. 0 or 0 means that is positive semi-definite or positive definite, respectively. R d denotes a set where the diameter of is defined as diam( ) = max x 1 x 2 2. x 1,x 2 and 1 denotes an indicator function of the set. We denote P and Q as two probability distributions, P( ) denotes the set of probability distributions defined on, and Π(P, Q) denotes the set of all couplings of P and Q, i.e., the set of all joint distributions over with marginal distributions being P and Q. We assume that φ is a strictly convex and twice-differentiable function with an L-Lipschitz continuous gradient, i.e., 0 2 φ(x) LI d, where x dom(φ), i.e., the domain of φ, and I d is an identity matrix in R d d. For the statistical learning setup, we define P r as an unknown true probability distribution, P n as the empirical distribution based on n observations from P r, and {P θ : θ R d } as a parametric family of probability distributions. 2.2 Wasserstein Divergence Definition 2.1. The Wasserstein divergence of order p between the probability distributions P and Q is defined as ( W p (P, Q) = inf π Π(P,Q) [c(x, y)] p π(dx, dy)) 1/p, (1) where p 1 and c 0 is a metric supported on. An important special case is the Wasserstein-L q divergence of order p as follows, ( Wp Lq (P, Q) = inf π Π(P,Q) x y p q π(dx, dy) ) 1/p. (2) Letting q = 2 and = R d in (2), we obtain the squared Wasserstein-L 2 divergence of order 2: W L2 2 (P, Q) = = ( inf π Π(P,Q) x 2 2 (P + Q) (dx) x y 2 2 π(dx, dy) ) 1/2 sup π Π(P,Q) 2x y π(dx, dy) Remark 1. Given P and Q, we have the following two properties of the Wasserstein divergence of order p, 1. W p (P, Q) 0 and the equality holds if and only if P = Q almost everywhere. 4

5 2. W p (P, Q) is a metric since W p (P, Q) = W p (Q, P) and where S is another probability distribution. W p (P, Q) W p (P, S) + W p (S, Q) 2.3 Bregman Divergence Definition 2.2 ([18]). The Bregman divergence with a strictly convex and differentiable function φ : R d R is denoted as for any x, y R d and D φ (P, Q) = D φ (x, y) = φ(x) φ(y) φ(y), x y (3) [φ (p(x)) φ (q(x)) φ (q(x)), p(x) q(x) ] dx, for two continuous probability distributions P and Q, where p = dp dµ absolutely continuous with respect to the Lebesgue measure µ. and q = dq dµ given that P and Q are Examples of the function φ and the resulting Bregman divergences are listed as follows, L 2 divergence: D φ (x, y) = x y 2 2 where φ(x) = x 2 2, Itakura-Saito divergence: D φ (x, y) = x y log( x y ) 1 where φ(x) = log x, KL divergence: D φ (x, y) = x log( x y ) where φ(x) = x log(x), Mahalanobis divergence: D φ (x, y) = (x y) A(x y) where φ(x) = x Ax and A 0. Remark D φ (x, y) 0, due to the convexity of φ and the equality holds true if and only if x = y. 2. D φ (x, y) is not be a metric: it is not symmetric and it violates the triangle inequality. 3. Bregman divergences are asymptotically equivalent to f-divergences (in particular, χ 2 -divergence) under some conditions [30], and are the unique class of divergences where the conditional expectation is the optimal predictor [3]. 4. In statistical learning, the Bregman divergence is extensively exploited for K-means clusterings [4]. We also provide a lemma which will be used in our analysis. Lemma 2.1. Assume that φ : R is a strictly convex and twice-differentiable function with an L-Lipschitz continuous gradient, D φ (x, y) L 2 x y 2 2 for any x, y R d. 5

6 Proof. D φ (x, y) = φ(x) φ(y) φ(y), x y = = ( 1 φ (tx + (1 t)y), x y dt φ(y), x y φ (tx + (1 t)y) φ(y), x y dt 0 = L 2 x y 2 2. ) t dt L x y 2 2 where the second equality comes from the mean value theorem and the inequality comes from the fact that φ is a twice-differentiable function with an L-Lipschitz continuous gradient. 3 Relaxed Wasserstein Divergence We now propose a new class of statistical divergence called Relaxed Wasserstein (RW) divergence, which can be seen as a combination of Bregman divergence and Wasserstein divergence. The term relaxed refers to the fact that RW divergence relaxes the symmetry of cost function c(x, y) in (1) and extends to a broader class of asymmetric divergences. Definition 3.1. The Relaxed Wasserstein divergence between the probability distributions P and Q is defined as W Dφ (P, Q) = inf D φ (x, y) π(dx, dy), π Π(P,Q) where D φ is the Bregman divergence with a strictly convex and differentiable function φ : R d R. Remark W Dφ (P, Q) 0 and the equality holds if and only if P = Q almost everywhere. 2. W Dφ (P, Q) is not a metric since D φ (x, y) is asymmetric. 3. W Dφ (P, Q) includes two important special cases, W2 L2 and W KL. More specifically, W Dφ = W2 L2 when φ(x) = x 2 2 and W D φ = W KL when φ(x) = x log(x). 3.1 Probabilistic Properties In this subsection, we establish several probabilistic properties of RW divergence. Recall that the Wasserstein divergence is controlled by weighted Total Variation (TV) distance (Theorem 6.15 [35] for more details). In parallel, we show that the RW divergence is dominated by the weighted TV distance and the squared Wasserstein-L 2 divergence. Definition 3.2. The Total Variation distance between the probability distributions P and Q is defined as where A is a Borel set. T V (P, Q) := sup P(A) Q(A), (4) A 6

7 Theorem 3.1. Assume that φ : R is a strictly convex and twice-differentiable function with an L-Lipschitz continuous gradient, then W Dφ (P, Q) L [diam( )] 2 T V (P, Q), W Dφ (P, Q) 1 [ 2 2 L W2 L2 (P, Q)], where P and Q are two probability distributions supported on a compact set R d. Proof. For the first inequality, we define π as the transfer plan that keeps all the mass shared by P and Q fixed and distributes the rest uniformly, i.e., π (dx, dy) = (P Q)(dx)δ {y=x} + 1 a (P Q) +(dx) (P Q) (dy), where P Q = P (P Q) + and a = (P Q) + [ ] = (P Q) [ ]. Then W Dφ (P, Q) D φ (x, y) π (dx, dy) = 1 [φ(x) φ(y) φ(y), x y ] (P Q) a + (dx) (P Q) (dy) = 1 [ 1 ] φ(tx + (1 t)y) φ(y), x y dt (P Q) + (dx) (P Q) a (dy) 0 1 [( 1 ) ] tdt L x y 2 2 (P Q) + (dx) (P Q) a (dy) 0 L [ ] x y 2 2 (P Q) + (dx) (P Q) 2a (dy) L [ ] x x a + x 0 y 2 2 (P Q) + (dx) (P Q) (dy) [ ] L x x (P Q) + (dx) + y x (P Q) (dy) = L x x P Q (dx) = L [diam( )] 2 P( ) Q( ) L [diam( )] 2 T V (P, Q), where the first inequality comes from Definition 3.1, the first equality comes from Definition 2.2 and the definition of the specific π, the second inequality is by Lemma 2.1, the fourth inequality comes from the triangle inequality, and the last inequality is from Definition 3.2. For the second inequality, we have W Dφ (P, Q) = inf π Π(P,Q) 0.5L inf = 0.5L π Π(P,Q) D φ (x, y) π(dx, dy) x y 2 2 π(dx, dy) [ W L2 2 (P, Q)] 2, where the inequality holds thanks to Lemma 2.1 and the fact that π(dx, dy) 0 for any coupling π Π(P, Q). This completes the proof. 7

8 Next, we establish another key probabilistic property of RW divergence, i.e., the nonasymptotic moment estimates and the concentration inequality. Our results follow from two theorems presented in [14] and Theorem 3.1. We assume that φ : R is a strictly convex and twice-differentiable function with an L-Lipschitz continuous gradient, and we further define two statistics, M q (P r ) = x q 2 P r(dx), and E α,γ (P r ) = exp (γ x α 2 ) P r (dx). Theorem 3.2 (Nonasymptotic Moment Estimate). Assume that M q (P r ) < + for some q > 2, then there exists a constant C(q, d) > 0 such that, for n 1, E [ W Dφ (P n, P r ) ] C(q, d)lm 2 q q (P r ) 2 n n q 2 q, 1 d 3, q 4, n 1 2 log(1 + n) + n q 2 q, d = 4, q 4, n 2 d + n q 2 q, d 5, q d/(d 2). Theorem 3.3 (Concentration Inequality). Assume one of the three following conditions holds, Then for n 1 and ɛ > 0, α > 2, γ > 0, E α,γ (P r ) <, (5) or α (0, 2), γ > 0, E α,γ (P r ) <, (6) or q > 4, M q (P r ) <. (7) Prob ( W Dφ (P n, P r ) ɛ ) a(n, ɛ)1 {ɛ L } + b(n, ɛ), 2 where a(n, ɛ) = C 1 ( exp ( exp ( exp ) 4cnɛ2 L 2 4cnɛ2 cn ( 2ɛ L, 1 d 3, log 2 ( 2 + L 2ɛ) ) L, d = 4, 2 ) ) d 2, d 5, and b(n, ɛ) = C 2 ( exp ( exp n ( 2nɛ L ) α ) 2 cn ( 2ɛ L 1 {ɛ> L }, under condition (5), ) 2 ( c( 2nɛ L ) α ɛ 2 1 {ɛ L 2 } + exp c ( ) α ) 2nɛ 2 L 1 {ɛ> L }, 0 < ɛ < α, under condition (6), 2 ) q ɛ 2, 0 < ɛ < q, under condition (7). where c, C 1 and C 2 are constants depending on q and d. 3.2 Continuity, Differentiability and Duality Representation In this subsection, we establish the continuity, differentiabililty and duality representation of RW divergence, demonstrating that RW divergence is a reasonable choice for the GANs. We first present a simple yet important lemma. 8

9 Lemma 3.4 (Decomposition of RW divergence). The RW divergence can be decomposed in terms of the distorted squared Wasserstein-L 2 divergence of order 2 with several additional residual terms independent of the choice of coupling π, i.e., [ W Dφ (P, Q) = W2 L2 ( P, Q ( φ) 1 )] 2 [ φ(x) 1 2 x 2 2 The relationship is also presented in Figure 1, ] P(dx) + [ φ(x), x φ(x) 12 ] φ(x) 2 Q(dx). Q ( φ) 1 Q φ W Dφ (P,Q) W L2 2 (P,Q φ) P Figure 1: The decomposition of W Dφ where the solid arrow denotes transformation and the dashed arrows denote the divergences between probability distributions. Proof. First, we need to prove that the inverse of φ is well-defined. Recall that φ : R is a strictly convex and twice-differentiable function, then we have 2 φ(x) 0, for x. That is to say, the gradient mapping φ : R d has a positive-definite Jacobian matrix at each point. Applying the mean value theorem yields that φ is injective so the inverse of φ exists and is bijective. Denote it as ( φ) 1 : φ( ), then Q ( φ) 1 : R d R is also a probability distribution. Thus W Dφ (P, Q) = inf [φ(x) φ(y) φ(y), x y ] π(dx, dy) π Π(P,Q) [ 1 = inf π Π(P,Q) 2 x ] 2 φ(y) 2 2 φ(y), x π(dx, dy) [ + φ(x) 1 ] 2 x 2 2 π(dx, dy) + [ φ(y), y φ(y) 12 ] φ(y) 2 [ 1 = inf π Π(P,Q) 2 x ] 2 φ(y) 2 2 φ(y), x π(dx, dy) [ + φ(x) 1 ] 2 x 2 2 P(dx) + [ φ(x), x φ(x) 12 ] φ(x) 2 Q(dx). π(dx, dy) 9

10 Furthermore, observe that [ ( W2 L2 P, Q ( φ) 1 )] 2 Therefore, [ W Dφ (P, Q) = This completes the proof. W2 L2 = inf x y 2 π Π(P,Q ( φ) 1 2 π(dx, dy) ) R d = inf x φ(y) 2 2 π(dx, dy) π Π(P,Q) R d ] = inf [ x π Π(P,Q) φ(y) φ(y), x π(dx, dy). ( P, Q ( φ) 1 )] 2 [ φ(x) 1 2 x 2 2 ] P(dx) + [ φ(x), x φ(x) 12 ] φ(x) 2 Q(dx). Now we are ready to present our main results on the continuity and differentiability of the parametrized RW divergence in the generative modeling. Definition 3.3 (Generative modeling). The procedure of generative modeling is to approximate the unknown probability distribution P r by constructing a class of suitable parametric probability distributions P θ. More specifically, we define a latent variable Z Z with a fixed probability distribution P Z and a sequence of parametric functions g θ : Z. Then P θ is defined as the probability distribution of g θ (Z). Theorem 3.5 (Continuity and Differentiability of RW divergence). We have the following two statements about RW divergence: 1. W Dφ (P r, P θ ) is continuous in θ if g θ is continuous in θ. 2. W Dφ (P r, P θ ) is differentiable almost everywhere if g θ is locally Lipschitz with a constant L(θ, z) such that E [ L(θ, Z) 2] <, i.e., for each given (θ 0, z 0 ), there exists a neighborhood N such that g θ (z) g θ0 (z 0 ) 2 L(θ 0, z 0 ) ( θ θ z z 0 2 ). for any (θ, z) N. Proof. It follows from Lemma 3.4 that where T 1 = 1 2 T 2 = [ W2 L2 ( Pr, P θ ( φ) 1)] 2, ] [ φ(x) 1 2 x 2 2 W Dφ (P r, P θ ) = T 1 + T 2, P r (dx) + [ φ(x), x φ(x) 12 ] φ(x) 2 P θ (dx). We observe that T 2 is continuous and differentiable with respect to θ since φ is a twice differentiable function. Furthermore, since ( φ) 1 is also continuous and differentiable, it suffices to show that W L2 2 (P r, P θ ) is 10

11 continuous in θ if g θ is continuous in θ, and differentiable almost everywhere if g θ is locally Lipschitz with a constant L(θ, z) such that E [ L(θ, Z) 2] < for any θ. Given two vectors θ 0, θ R d, we define π as a joint distribution of (g θ (Z), g θ0 (Z)) where Z P Z, then where W L2 2 (P θ, P θ0 ) = ( ( Z x y 2 2 π(dx, dy) ) 1/2 g θ (z) g θ0 (z) 2 2 P Z(dz)) 1/2, g θ (z) g θ0 (z) 2 2 0, z Z, since g θ is continuous in θ. Furthermore, g θ1 (z) g θ2 (z) 2 2 is uniformly bounded on Z since g θ(x) and is a compact set. Therefore, applying the bounded convergence theorem yields W2 L2 (P r, P θ ) W2 L2 (P r, P θ0 ) W2 L2 (P θ, P θ0 ) ( g θ (z) g θ0 (z) 2 2 P Z(dz) Z 0, as θ θ 0. where the first inequality comes from the triangle inequality. Given a pair (θ 0, z 0 ), the local Lipschitz continuity of g θ implies that there exists a neighborhood N such that g θ (z) g θ0 (z 0 ) 2 L(θ 0, z 0 ) ( θ θ z z 0 2 ). ) 1/2 for any (θ, z) N. Then g θ (z 0 ) g θ0 (z 0 ) 2 2 P Z(dz 0 ) Z Z [L(θ 0, z 0 )] 2 θ θ P Z (dz 0 ) = θ θ E [ L(θ 0, Z) 2]. Therefore, W2 L2 (P r, P θ ) W2 L2 (P r, P θ0 ) W2 L2 (P θ, P θ0 ) ( g θ (z 0 ) g θ0 (z 0 ) 2 2 P Z(dz 0 ) Z θ θ 0 2 E [ L(θ, Z) 2] 1/2, ) 1/2 which implies that W L2 2 (P r, P θ ) is locally Lipschitz. Applying the Rademacher s theorem [13] yields that W L2 2 (P r, P θ ) is differentiable with respect to θ almost everywhere. This completes the proof. Next, we turn to present our results on the duality representation of RW divergence. Theorem 3.6 (Duality Representation of RW divergence). Assume that two probability distributions P and Q such that x 2 2 (P + Q) (dx) < +. 11

12 Then there exists a Lipschitz continuous function f : R such that the RW divergence has a duality representation as W Dφ (P, Q) = ( ) φ(x) (P Q) (dx) + φ(x), x Q(dx) f(x) P(dx) + f ( φ(x)) Q(dx), where f is the conjugate of f, i.e., f (y) = sup x R d x, y f(x). Proof. Given two probability distributions P and Q that satisfy x 2 2 (P + Q) (dx) < +, it follows from Proposition 3.1 [6] that there exists a Lipschitz continuous function f : R such that the squared Wasserstein-L 2 divergence of order 2 has a duality representation, i.e., [ 2 W2 L2 (P, Q)] = inf x y 2 2 π(dx, dy) π Π(P,Q) ( ) = x 2 2 (P + Q) (dx) 2 f(x) P(dx) + f (x) Q(dx), where f is the convex conjugate of f, i.e., f (y) = sup x R d x, y f(x). Combining Lemma 3.4 yields that W Dφ (P, Q) = 1 [ ( W2 L2 P, Q ( φ) 1 )] 2 2 [ + φ(x) 1 ] 2 x 2 2 P(dx) + [ φ(x), x φ(x) 12 ] φ(x) 2 Q(dx), = 1 ( ) ( ) x P(dx) + φ(x) 2 2 Q(dx) f(x) P(dx) + f ( φ(x)) Q(dx) [ + φ(x) 1 ] 2 x 2 2 P(dx) + [ φ(x), x φ(x) 12 ] φ(x) 2 Q(dx) ( ) = φ(x) (P Q) (dx) + φ(x), x Q(dx) f(x) P(dx) + f ( φ(x)) Q(dx). This completes the proof. Finally, we show that Theorem 3.6 allows for an explicit formula for the gradient evaluation in the generative modeling (Definition 3.3), providing the theoretical guarantee for the RWGANs training. Corollary (Gradient Evaluation). Under the setting of generative modeling, we assume that g θ is locally Lipschitz with a constant L(θ, z) such that E [ L(θ, Z) 2] < and x 2 2 (P r + P θ ) (dx) < +. Then there exists a Lipschitz continuous solution f : R such that the gradient of the RW divergence has an explicit form, i.e., [ θ WDφ (P r, P θ ) ] ] = E Z [[ θ g θ (Z)] 2 φ(g θ (Z))g θ (Z) + E Z [ θ f ( φ(g θ (Z)))]. 12

13 Proof. Under the conditions that g θ is a locally Lipschitz and x 2 2 (P r + P θ ) (dx) < +, it follows from Theorem 3.5 and Theorem 3.6 that W Dφ (P r, P θ ) is differentiable almost everywhere and there exists a Lipschitz continuous function f : R such that the RW divergence has a duality representation as W Dφ (P r, P θ ) = ( φ(x) (P r P θ ) (dx)+ φ(x), x P θ (dx) f(x) P r (dx) + ) f ( φ(x)) P θ (dx). By the envelope theorem [26], we obtain that [ θ WDφ (P r, P θ ) ] ] = θ [ φ(x) P θ (dx) + φ(x), x P θ (dx) f ( φ(x)) P θ (dx) ] = θ [ φ(g θ (z)) P Z (dz) + φ(g θ (z)), g θ (z) P Z (dz) f ( φ(g θ (z))) P Z (dz) Z Z Z = [ θ g θ (z)] φ(g θ (z)) P Z (dz) + [ θ g θ (z)] φ(g θ (z)) P Z (dz) Z Z + [ θ g θ (z)] 2 φ(g θ (z))g θ (z) P Z (dz) θ f ( φ(g θ (z))) P Z (dz) Z Z = [ θ g θ (z)] 2 φ(g θ (z))g θ (z) P Z (dz) θ f ( φ(g θ (z))) P Z (dz) Z Letting f = f, we conclude that [ θ WDφ (P r, P θ ) ] ] = E Z [[ θ g θ (Z)] 2 φ(g θ (Z))g θ (Z) + E Z [ θ f ( φ(g θ (Z)))], where f is Lipschitz continuous. This completes the proof. Z 4 Empirical Results In this section, we will present numerical evaluation on image generations to demonstrate the effectiveness and efficiency of using RW in GANs. We will first describe our approach of RW in GANs (RWGANs) and the experimental setting (Section 4.1), and then report the experimental results under RWGANs and other nine well-established variants of GANs (Section 4.2). 4.1 Experimental Approach and Setting RWGANs approach. The goal of the GANs is to estimate a probability distribution P r hidden in the data. As defined in Definition 3.3, one can define a random variable Z with a fixed distribution P Z and pass it through a parametric function g θ : Z to construct a probability distribution P θ. In this light, one can learn the probability distribution P r by adapting θ and fitting the data with P θ. This approximation is done by finding a solution f that optimizes a given cost function between P r and P θ. Despite the theoretical explicit formulas derived in the duality representation and the gradient evaluation (Theorem 3.6 and Corollary 3.6.1), it is infeasible to directly compute such an f in practice. Nevertheless, 13

14 since the Wasserstein divergence is parametrized by any convex function in RWGANs, it provides a great deal of flexibility in the choice of loss functions. For example, one can choose an appropriate φ such that [ θ WDφ (P r, P θ ) ] E Z [ θ f ( φ(g θ (z)))]. In our experiment, we try the KL divergence where 2 φ(x) = diag(1/x) since we observe that ] [ ] E Z [[ θ g θ (Z)] 2 φ(g θ (Z))g θ (Z) = E Z [ θ g θ (Z)] 1 C, where C is a constant depending on the Lipschitz constant of g θ. This implies that this term is controlled by θ during the process of training. The numerical results confirm the effectiveness of our heuristic in practice. Experimental framework. Our experimental framework is similar to the one in WGANs [2] in that a) we apply back-propagation to train the generator and discriminator networks, and b) we update the parameters once in the generative model and n critic times in the discriminator network. Our framework differs from the WGANs [2] in several aspects. First, we use φ to do the asymmetric clipping instead of the symmetric clipping. Note that the asymmetric clipping guarantees the Lipschitz continuity of f and φ(w) [ c, c]. Second, we use a scaling parameter S to stabilize the asymmetric clipping. This is critical for the experiment. Finally, we adopt RMSProp [34] instead of ADAM [19], which allows a choice of a larger step-size and avoids the non-stationary problem [28]. We describe our method with default parameters in Algorithm 1. Algorithm 1 RWGANs. The default values α = , c = 0.005, S = 0.01, m = 64, n critic = 5. Require: α: the learning rate; c: the clipping parameter; m: the batch size; n critic, the number of iterations of the critic per generator iteration; N max, the maximum number of one forward pass and one backward pass of all the training examples. Require: w 0, initial critic parameters; θ 0 : initial generator s parameters. for N = 1, 2,..., N max do for t = 0,..., n critic do Sample a batch of real data {x i } m i=1 from P r. Sample a batch of prior samples {z i } m i=1 from p(z). g w 1 m m i=1 [ wf w (x i ) w f w (g θ (z i ))]. w w + α RMSProp(w, g w ). w clip ( w, S ( φ) 1 ( c), S ( φ) 1 (c) ). end for Sample a batch of prior samples {z i } m i=1 from p(z). g θ 1 m m i=1 θf w ( φ(g θ (z i ))). θ θ α RMSProp(θ, g θ ). end for 14

15 Experimental setting. In order to test RWGANs, we adopt nine baseline methods as discussed in the introduction. They are RWGANs, WGANs [2], WGANs-GP [17], CGANs [27], InfoGANs [10], GANs [16], LSGANs [24], DRAGANs [20], BEGANs [5], EBGANs [38], and ACGANs [29]. The implementation of all these approaches is based on publicly available online information 1. In addition, we use the following four standard and well-known datasets in our experiment. 1. MNIST 2 is a dataset of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. 2. Fashion-MNIST 3 is an alternative dataset of Zalando s article images to MNIST. It consists of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a gray-scale image, associated with a label from 10 classes. 3. The CIFAR-10 4 dataset consists of color images in 10 classes, with 6000 images per class. There are training images and test images. 4. The ImageNet 5 dataset is a large visual database designed for visual object recognition research. As of 2016, over ten million URLs of images have been hand-annotated by ImageNet to indicate which objects are in the picture. In at least one million of the images, bounding boxes are also provided. Metric. The negative critic loss, well-known as the standard quantitative metric, is used in all our experiments. In addition to the negative critic loss, we use the inception score [33] to evaluate samples generated by three WGANs methods on CIFAR-10 and ImageNet. The inception score is defined as follows: Inception_Score = exp {E x [D KL (p(y x), p(y)]}, where p(y x) is given by the inception network. A high inception score is an indicator that the images generated by the model are highly interpretable and diversified. It is also highly correlated with human evaluation of the images. 4.2 Experimental Result Experiments on MNIST and Fashion-MNIST: We start our experiment by training models using the ten different GANs procedures on MNIST and Fashion-MNIST. The architecture is DCGAN [31] and the maximum number of epochs is 100. Figure 3 shows the training curves of the negative critic loss of all candidate approaches. The figure indicates that RWGANs and WGANs are stable with the smallest variances, where RWGANs has a slight kriz/cifar.html x64.tar 15

16 higher variance partly due to the use of a larger step-size and asymmetric clipping. This slightly higher variance, nevertheless, speeds up the rate of training. Indeed, as illustrated in Figure 4 and Figure 5, RWGANs is the fastest to generate meaningful images. Note that CGANs and InfoGANs seem faster but the images they have generated are not meaningful, as they fall into bad local optima from an optimization perspective. Experiments on CIFAR-10 and ImageNet: After observing that WGANs and RWGANs perform the best among all the variants of GANs, we proceed to compare RWGANs and WGANs, together with WGANs with Gradient Penalty (WGANs-GP), on two much larger datasets CIFAR-10 and ImageNet. Here the architectures used are DCGAN and ReLU-MLP [11] and the maximum number of epochs is set to 25. Figure 6 shows the training curves of the negative critic loss of all candidate approaches again. Except for the small variance of WGANs, we observe that, in terms of the negative loss, WGANs-GP tend to diverge as the training progresses, implying that such method might not be robust in practice despite its fast rate of training. In this case, RWGANs achieve relatively low variance with convergent negative critic loss, leading to a trade-off between robustness and efficiency. We then evaluate the candidate methods with the inception score and present the results in Table 2. The table shows that RWGANs are often the fastest method. They perform the best in three out of four cases during several early epochs, and obtain images with competitively high quality at the final stage. Figures 7, Figure 8, Figure 9 and Figure 10 show the sample qualities of the image generated at the initial and final stages, which strongly supports our conclusion. Architecture DCGAN MLP Method CIFAR-10 ImageNet First 5 epochs Last 10 epochs First 3 epochs Last 5 epochs RWGANs WGANs WGANs-GP RWGANs WGANs WGANs-GP Figure 2: Inception scores at the beginning and final stages of training. DCGAN refers to the standard DCGAN generator and MLP refers to an ReLU-MLP with 4 hidden layers and 512 units at each layer. 5 Conclusion We propose a novel class of statistical divergence called RW divergence and establish several important theoretical properties, crucial for the GANs training. The experiments, with RW parametrized by the KL divergence in image generation, show that RWGANs is a promising trade-off between WGANs and WGANs-GP, achieving both the robustness and efficiency during the learning process. The asymmetric clipping in RWGANs is a viable alternative to the gradient penalty and the symmetric clipping in WGANs, 16

17 avoiding the low-quality samples and the failure of convergence. The flexible framework of RW divergences opens a door to the implementation and comparison of different loss functions for GANs. Meanwhile it raises a natural question on whether one can select φ according to the statistics of data and the structure of the problem. References [1] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. Ariv Preprint: , [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. ICML, pages , [3] A. Banerjee,. Guo, and H. Wang. On the optimality of conditional expectation as a Bregman predictor. IEEE Transactions on Information Theory, 51(7): , [4] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. Journal of Machine Learning Research, 6(Oct): , [5] D. Berthelot, T. Schumm, and L. Metz. BEGAN: Boundary equilibrium generative adversarial networks. Ariv Preprint: , [6] Y. Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics, 44(4): , [7] L. Caffarelli. Some regularity properties of solutions of Monge Ampère equation. Communications on Pure and Applied Mathematics, 44(8-9): , [8] L. Caffarelli. The regularity of mappings with a convex potential. Journal of the American Mathematical Society, 5(1):99 104, [9] S. Chen and A. Figalli. Partial W 2,p regularity for optimal transport maps. Journal of Functional Analysis, 272(11): , [10]. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. NIPS, pages , [11] B. Conan-Guez and F. Rossi. Multi-Layer Perceptrons for functional data analysis: a projection based approach. Artificial Neural Networks ICANN 2002, [12] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a Laplacian pyramid of adversarial networks. NIPS, pages , [13] L. Evans and R. Gariepy. Measure Theory and Fine Properties of Functions. CRC Press,

18 [14] N. Fournier and A. Guillin. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3-4): , [15] A. Ghosh, V. Kulharia, A. Mukerjee, V. Namboodiri, and M. Bansal. Contextual RNN-GANs for abstract reasoning diagram generation. AAAI, pages , [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. u, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. NIPS, pages , [17] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. Ariv Preprint: , [18] L. K. Jones and C. L. Byrne. General entropy criteria for inverse problems, with applications to data compression, pattern classification, and cluster analysis. IEEE Transactions on Information Theory, [19] D. Kingma and J. Ba. Adam: A method for stochastic optimization. Ariv Preprint: , [20] N. Kodali, J. Abernethy, J. Hays, and Z. Kira. How to train your DRAGAN. Ariv Preprint: , [21] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. CoRR, abs/ , [22] J. Li, A. Madry, J. Peebles, and L. Schmidt. Towards understanding the dynamics of generative adversarial networks. Ariv Preprint: , [23] P. Luc, C. Couprie, S. Chintala, and J. Verbeek. Semantic segmentation using adversarial networks. In NIPS Workshop on Adversarial Training, [24]. Mao, Q. Li, H. ie, R. Y. K. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. Ariv Preprint: , [25] L. Mescheder, S. Nowozin, and A. Geiger. Adversarial variational Bayes: Unifying variational autoencoders and generative adversarial networks. In ICML, pages , [26] P. Milgrom and I. Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2): , [27] M. Mirza and S. Osindero. Conditional generative adversarial nets. Ariv Preprint: , [28] V. Mnih, A. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. ICML, pages , [29] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier GANs. ICML, pages ,

19 [30] M. C. Pardo and I. Vajda. On asymptotic properties of information-theoretic divergences. IEEE Transactions on Information Theory, 49(7): , [31] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. Arxiv Preprint: , [32] S. Reed, Z. Akata,. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text-to-image synthesis. ICML, pages , [33] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and. Chen. Improved techniques for training GANs. NIPS, pages , [34] T. Tieleman and G. Hinton. Lecture 6.5-RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2), [35] C. Villani. Optimal Transport: Old and New, volume 338. Springer Science & Business Media, [36] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. NIPS, pages , [37] R. Yeh, C. Chen, T. Y. Lim, M. Hasegawa-Johnson, and M. N. Do. Semantic image inpainting with perceptual and contextual losses. Ariv Preprint: , [38] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. Ariv Preprint: , [39] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In ECCV, pages ,

20 Method MNIST Fashion-MNIST Method RWGANs LSGANs WGANs DRAGANs CGANs BEGANs InfoGANs EBGANs GANs ACGANs MNIST Fashion-MNIST Figure 3: Training curves of the negative critic loss at different stages of training on MNIST and FashionMNIST. Gloss and Dloss refer to the loss in generative and discriminative nets, which is plotted in orange and blue lines, respectively. 20

21 Method N = 1 N = 10 N = 25 N = 100 RWGANs WGANs CGANs InfoGANs GANs 21

22 Method N = 1 N = 10 N = 25 N = 100 LSGANs DRAGANs BEGANs EBGANs ACGANs Figure 4: Sample qualities at different stages of training on MNIST. 22

23 Method N = 1 N = 10 N = 25 N = 100 RWGANs WGANs CGANs InfoGANs GANs 23

24 Method N = 1 N = 10 N = 25 N = 100 LSGANs DRAGANs BEGANs EBGANs ACGANs Figure 5: Sample qualities at different stages of training on Fashion-MNIST. 24

25 Method Architecture CIFAR-10 ImageNet RWGANs DCGAN MLP WGANs DCGAN MLP WGANs-GP DCGAN MLP Figure 6: Training curves at different stages of training. DCGAN refers to the standard DCGAN generator and MLP refers to an ReLU-MLP with 4 hidden layers and 512 units at each layer. G loss and D loss refer to the loss in generative and discriminative nets. The loss in RWGANs is shown to converge consistently while the loss in WGANs-GP tends to diverge as the training progresses. WGANs achieves the lowest variance among the three methods. 25

26 Method DCGAN N = 1 MLP RWGANs WGANs WGANs- GP Figure 7: Sample qualities at the initial stage of training on CIFAR

27 Method DCGAN N = 100 MLP RWGANs WGANs WGANs- GP Figure 8: Sample qualities at the final stage of training on CIFAR

28 Method DCGAN N = 1 MLP RWGANs WGANs WGANs- GP Figure 9: Sample qualities at the initial stage of training on ImageNet. 28

29 Method DCGAN N = 25 MLP RWGANs WGANs WGANs- GP Figure 10: Sample qualities at the final stage of training on ImageNet. 29

Lecture 14: Deep Generative Learning

Generative Modeling CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 14: Deep Generative Learning Density estimation Reconstructing probability density function using samples Bohyung Han