arxiv: v4 [stat.ml] 16 Sep 2018
|
|
- Kevin Wiggins
- 5 years ago
- Views:
Transcription
1 Relaxed Wasserstein with Applications to GANs in Guo Johnny Hong Tianyi Lin Nan Yang September 9, 2018 ariv: v4 [stat.ml] 16 Sep 2018 Abstract We propose a novel class of statistical divergences called Relaxed Wasserstein (RW) divergence. RW divergence generalizes Wasserstein divergence and is parametrized by a class of strictly convex and differentiable functions. We establish for RW divergence several probabilistic properties, which are critical for the success of Wasserstein divergence. In particular, we show that RW divergence is dominated by Total Variation (TV) and Wasserstein-L 2 divergence, and that RW divergence has continuity, differentiability and duality representation. Finally, we provide a non-asymptotic moment estimate and a concentration inequality for RW divergence. Our experiments on image generation demonstrate that RW divergence is a suitable choice for GANs. The performance of RWGANs with Kullback-Leibler (KL) divergence is competitive with other state-of-the-art GANs approaches. Moreover, RWGANs possess better convergence properties than the existing WGANs with competitive inception scores. To the best of our knowledge, this new conceptual framework is the first to provide not only the flexibility in designing effective GANs scheme, but also the possibility in studying different loss functions under a unified mathematical framework. 1 Introduction GANs. Generative Adversarial Networks (GANs) [16] provide a versatile class of models for generative modeling. Since their introduction to the machine learning community, the popularity of GANs have grown exponentially with numerous applications. Examples include high resolution image generation [12, 31], image inpainting [37], image super-resolution [21], visual manipulation [39], text-to-image synthesis [32], video generation [36], semantic segmentation [23], and abstract reasoning diagram generation [15]. See also [1, 22, 25] for more details on the training dynamics of GANs. The key idea behind GANs is to interpret the process of generative modeling as a competing game between two networks: a generator network and a discriminator network. The generator network attempts to fool Department of Industrial Engineering and Operations Research, University of California, Berkeley, USA. xinguo@berkeley.edu. Department of Statistics, University of California, Berkeley, USA. jcyhong@berkeley.edu. Department of Industrial Engineering and Operations Research, University of California, Berkeley, USA. darren_ lin@berkeley.edu. Department of Industrial Engineering and Operations Research, University of California, Berkeley, USA. nanyang@berkeley.edu. 1
2 the discriminator network by converting random noise into sample data while the discriminator network tries to identify whether the input sample is a faked data sample or a true data sample. There are many variants of GANs. Least square GANs (LSGANs) [24] attain a stable performance during the learning process, replacing the sigmoid cross entropy loss by the least square loss in the discriminator network of the original GANs. They are also shown to generate higher quality images than the original GANs in practice. DRAGANs [20] alleviate the instability of the GANs training and offer a clear game-theoretic justification, by introducing regret minimization to reach the equilibrium in games and to further explain the reason for the success of simultaneous gradient descent in GANs. Conditional GANs (CGANs) [27] are the conditional version of GANs, proposing to stabilize the training by imposing the control on modes of the generated data in an original generative model. Information-theoretic GANs (InfoGANs) [10] are an information-theoretic extensions to the GANs, providing highly semantic and meaningful hidden representations on a number of image datasets, by maximizing the mutual information between a fixed small subset of GAN s noises and the observations. Auxiliary Classifier GANs (ACGANs) [29] improve GANs by adding more structure to the latent space together with a specialized cost function, with high-quality samples. They also lead to a new analysis for assessing the discriminability and diversity of samples from class-conditional image synthesis models. Energy-Based GANs (EBGANs) [38] propose a new energy perspective of of GANs. They construct an energy function to measure the discriminator that attributes lower energies to the regions near the data manifold and higher energies to other regions. As a result, the EBGAN framework is shown to generate reasonable high-resolution images without a multi-scale approach. Boundary Equilibrium GANs (BEGANs) [5] adopt a new equilibrium enforcing method paired with the Wasserstein divergence to train GANs with an auto-encoder. This approach not only balances the generator network and the discriminator network but also uncovers a novel approximate convergence measure, leading to a fast and stable training with high visual-quality. WGANs. A recurring theme to improve GANs training is the choice of loss functions. The first proposed class of loss functions is based on the Jensen-Shannon (JS) divergence, which is essentially the symmetric version of the Kullback-Leibler (KL) divergence. It is shown in [2] that JS divergence is undesirable with unstable training, suggesting Wasserstein-L 1 divergence as an alternative. The resulting Wasserstein GANs (WGANs) outperform the original GANs in several aspects. The Wasserstein-L 1 divergence is continuous, differentiable and has a duality representation, allowing a very stable gradient flow in the process of training. Besides the stability, the Wasserstein-L 1 divergence also avoids the issue of mode collapse and further provides meaningful learning curves that can be used for debugging and for hyperparameter searching. With additional weight clipping [2] and gradient penalty [17], the volatility of the gradient is somehow controlled. Our work. We propose a novel class of statistical divergence called Relaxed Wasserstein (RW) divergence. RW divergence is Wasserstein divergence parametrized by a class of strictly convex and differentiable functions which contain different curvature information. Naturally, RW divergence provides more flexibility and possibilities in generative modeling. To ensure that RW divergence is a viable option for comparing probability distributions and is competitive with other Wasserstein divergence in the generative modeling, 2
3 this paper addresses the following theoretical questions along with related computational issues. Does the gradient of RW divergence exist and allows for an explicit form? Does RW divergence enjoy the same mathematical properties as the standard Wasserstein divergence? Does RW divergence have the duality representation as Wasserstein divergence? In this paper, we first show that RW divergence is dominated by the total variation (TV) distance and squared Wasserstein-L 2 divergence (Theorem 3.1). We then obtain its nonasymptotic moment estimate (Theorem 3.2), its concentration inequality (Theorem 3.3), and and its duality representation (Theorem 3.6). For application purpose, we show the existence of the gradient of RW divergence by first establishing its continuity and differentiability (Theorem 3.5). These properties ensure the gradient descent procedure, with an explicit formula for the gradient evaluation and an asymmetric clipping (Corollary 3.6.1). This asymmetric clipping is useful for controlling the volatility of the gradient. Finally, we compare the RWGANs with several state-of-the-art GANs in image generation. We use RWGANs with KL divergence and the architectures of DCGAN and MLP. We first evaluate all of candidate methods on MNIST and Fashion-MNIST datasets and show that RWGANs are competitive with other popular approaches. Then we conduct the experiment on CIFAR-10 and ImageNet datasets to investigate if RWGANs outperform WGANs with symmetric clipping and gradient penalty, denoted as WGANs and WGANs-GP respectively. Our numerical results suggest that RWGANs strike a balance between WGANs and WGANs-GP: WGANs-GP fail to converge although they can achieve the fastest rate of training, RWGANs are very robust and converge faster than WGANs. Therefore, RWGANs are more desirable for large-scale computations. Furthermore, RWGANs attain the highest inception scores at the initial stage of training on CIFAR-10, meaning that the generated samples correlate well with human evaluations [33]. As a byproduct, our experiment provides some evidences that an appropriate weight clipping has a potential to be competitive with gradient penalty in WGANs. Open question. Theoretically this new conceptual framework provides a unified mathematical framework to implement and investigate different Wasserstein divergences. Such flexibility raises a natural question on whether the underlying convex function φ can be determined in advance based on data samples and problem structure. While the main focus of this paper is the application to GANs, we believe that the theoretical results of RW divergence can be a valuable addition to the rich theory for optimal transport, where regularities of Wasserstein-based cost functions have been extensively studied [7, 8, 9, 35]. Organizations. The rest of the paper is organized as follows. Section 2 provides the preliminaries and notations that will be used throughout the paper. Section 3 describes the RW divergence and discusses its theoretical properties. In Sections 4.1 and 4.2, we discuss the implementation of the method and present two numerical studies on real data examples. Section 5 concludes our paper. 2 Background In this section, we review the definitions and properties of Bregman divergence and Wasserstein divergence. 3
4 2.1 Notations Throughout the paper, the following notations are used unless otherwise stated. We denote x R d as a vector in Euclidean space and as a matrix. x denotes the transpose of a vector x and log(x) denotes the component-wise logarithm of a vector x. 0 or 0 means that is positive semi-definite or positive definite, respectively. R d denotes a set where the diameter of is defined as diam( ) = max x 1 x 2 2. x 1,x 2 and 1 denotes an indicator function of the set. We denote P and Q as two probability distributions, P( ) denotes the set of probability distributions defined on, and Π(P, Q) denotes the set of all couplings of P and Q, i.e., the set of all joint distributions over with marginal distributions being P and Q. We assume that φ is a strictly convex and twice-differentiable function with an L-Lipschitz continuous gradient, i.e., 0 2 φ(x) LI d, where x dom(φ), i.e., the domain of φ, and I d is an identity matrix in R d d. For the statistical learning setup, we define P r as an unknown true probability distribution, P n as the empirical distribution based on n observations from P r, and {P θ : θ R d } as a parametric family of probability distributions. 2.2 Wasserstein Divergence Definition 2.1. The Wasserstein divergence of order p between the probability distributions P and Q is defined as ( W p (P, Q) = inf π Π(P,Q) [c(x, y)] p π(dx, dy)) 1/p, (1) where p 1 and c 0 is a metric supported on. An important special case is the Wasserstein-L q divergence of order p as follows, ( Wp Lq (P, Q) = inf π Π(P,Q) x y p q π(dx, dy) ) 1/p. (2) Letting q = 2 and = R d in (2), we obtain the squared Wasserstein-L 2 divergence of order 2: W L2 2 (P, Q) = = ( inf π Π(P,Q) x 2 2 (P + Q) (dx) x y 2 2 π(dx, dy) ) 1/2 sup π Π(P,Q) 2x y π(dx, dy) Remark 1. Given P and Q, we have the following two properties of the Wasserstein divergence of order p, 1. W p (P, Q) 0 and the equality holds if and only if P = Q almost everywhere. 4
5 2. W p (P, Q) is a metric since W p (P, Q) = W p (Q, P) and where S is another probability distribution. W p (P, Q) W p (P, S) + W p (S, Q) 2.3 Bregman Divergence Definition 2.2 ([18]). The Bregman divergence with a strictly convex and differentiable function φ : R d R is denoted as for any x, y R d and D φ (P, Q) = D φ (x, y) = φ(x) φ(y) φ(y), x y (3) [φ (p(x)) φ (q(x)) φ (q(x)), p(x) q(x) ] dx, for two continuous probability distributions P and Q, where p = dp dµ absolutely continuous with respect to the Lebesgue measure µ. and q = dq dµ given that P and Q are Examples of the function φ and the resulting Bregman divergences are listed as follows, L 2 divergence: D φ (x, y) = x y 2 2 where φ(x) = x 2 2, Itakura-Saito divergence: D φ (x, y) = x y log( x y ) 1 where φ(x) = log x, KL divergence: D φ (x, y) = x log( x y ) where φ(x) = x log(x), Mahalanobis divergence: D φ (x, y) = (x y) A(x y) where φ(x) = x Ax and A 0. Remark D φ (x, y) 0, due to the convexity of φ and the equality holds true if and only if x = y. 2. D φ (x, y) is not be a metric: it is not symmetric and it violates the triangle inequality. 3. Bregman divergences are asymptotically equivalent to f-divergences (in particular, χ 2 -divergence) under some conditions [30], and are the unique class of divergences where the conditional expectation is the optimal predictor [3]. 4. In statistical learning, the Bregman divergence is extensively exploited for K-means clusterings [4]. We also provide a lemma which will be used in our analysis. Lemma 2.1. Assume that φ : R is a strictly convex and twice-differentiable function with an L-Lipschitz continuous gradient, D φ (x, y) L 2 x y 2 2 for any x, y R d. 5
6 Proof. D φ (x, y) = φ(x) φ(y) φ(y), x y = = ( 1 φ (tx + (1 t)y), x y dt φ(y), x y φ (tx + (1 t)y) φ(y), x y dt 0 = L 2 x y 2 2. ) t dt L x y 2 2 where the second equality comes from the mean value theorem and the inequality comes from the fact that φ is a twice-differentiable function with an L-Lipschitz continuous gradient. 3 Relaxed Wasserstein Divergence We now propose a new class of statistical divergence called Relaxed Wasserstein (RW) divergence, which can be seen as a combination of Bregman divergence and Wasserstein divergence. The term relaxed refers to the fact that RW divergence relaxes the symmetry of cost function c(x, y) in (1) and extends to a broader class of asymmetric divergences. Definition 3.1. The Relaxed Wasserstein divergence between the probability distributions P and Q is defined as W Dφ (P, Q) = inf D φ (x, y) π(dx, dy), π Π(P,Q) where D φ is the Bregman divergence with a strictly convex and differentiable function φ : R d R. Remark W Dφ (P, Q) 0 and the equality holds if and only if P = Q almost everywhere. 2. W Dφ (P, Q) is not a metric since D φ (x, y) is asymmetric. 3. W Dφ (P, Q) includes two important special cases, W2 L2 and W KL. More specifically, W Dφ = W2 L2 when φ(x) = x 2 2 and W D φ = W KL when φ(x) = x log(x). 3.1 Probabilistic Properties In this subsection, we establish several probabilistic properties of RW divergence. Recall that the Wasserstein divergence is controlled by weighted Total Variation (TV) distance (Theorem 6.15 [35] for more details). In parallel, we show that the RW divergence is dominated by the weighted TV distance and the squared Wasserstein-L 2 divergence. Definition 3.2. The Total Variation distance between the probability distributions P and Q is defined as where A is a Borel set. T V (P, Q) := sup P(A) Q(A), (4) A 6
7 Theorem 3.1. Assume that φ : R is a strictly convex and twice-differentiable function with an L-Lipschitz continuous gradient, then W Dφ (P, Q) L [diam( )] 2 T V (P, Q), W Dφ (P, Q) 1 [ 2 2 L W2 L2 (P, Q)], where P and Q are two probability distributions supported on a compact set R d. Proof. For the first inequality, we define π as the transfer plan that keeps all the mass shared by P and Q fixed and distributes the rest uniformly, i.e., π (dx, dy) = (P Q)(dx)δ {y=x} + 1 a (P Q) +(dx) (P Q) (dy), where P Q = P (P Q) + and a = (P Q) + [ ] = (P Q) [ ]. Then W Dφ (P, Q) D φ (x, y) π (dx, dy) = 1 [φ(x) φ(y) φ(y), x y ] (P Q) a + (dx) (P Q) (dy) = 1 [ 1 ] φ(tx + (1 t)y) φ(y), x y dt (P Q) + (dx) (P Q) a (dy) 0 1 [( 1 ) ] tdt L x y 2 2 (P Q) + (dx) (P Q) a (dy) 0 L [ ] x y 2 2 (P Q) + (dx) (P Q) 2a (dy) L [ ] x x a + x 0 y 2 2 (P Q) + (dx) (P Q) (dy) [ ] L x x (P Q) + (dx) + y x (P Q) (dy) = L x x P Q (dx) = L [diam( )] 2 P( ) Q( ) L [diam( )] 2 T V (P, Q), where the first inequality comes from Definition 3.1, the first equality comes from Definition 2.2 and the definition of the specific π, the second inequality is by Lemma 2.1, the fourth inequality comes from the triangle inequality, and the last inequality is from Definition 3.2. For the second inequality, we have W Dφ (P, Q) = inf π Π(P,Q) 0.5L inf = 0.5L π Π(P,Q) D φ (x, y) π(dx, dy) x y 2 2 π(dx, dy) [ W L2 2 (P, Q)] 2, where the inequality holds thanks to Lemma 2.1 and the fact that π(dx, dy) 0 for any coupling π Π(P, Q). This completes the proof. 7
8 Next, we establish another key probabilistic property of RW divergence, i.e., the nonasymptotic moment estimates and the concentration inequality. Our results follow from two theorems presented in [14] and Theorem 3.1. We assume that φ : R is a strictly convex and twice-differentiable function with an L-Lipschitz continuous gradient, and we further define two statistics, M q (P r ) = x q 2 P r(dx), and E α,γ (P r ) = exp (γ x α 2 ) P r (dx). Theorem 3.2 (Nonasymptotic Moment Estimate). Assume that M q (P r ) < + for some q > 2, then there exists a constant C(q, d) > 0 such that, for n 1, E [ W Dφ (P n, P r ) ] C(q, d)lm 2 q q (P r ) 2 n n q 2 q, 1 d 3, q 4, n 1 2 log(1 + n) + n q 2 q, d = 4, q 4, n 2 d + n q 2 q, d 5, q d/(d 2). Theorem 3.3 (Concentration Inequality). Assume one of the three following conditions holds, Then for n 1 and ɛ > 0, α > 2, γ > 0, E α,γ (P r ) <, (5) or α (0, 2), γ > 0, E α,γ (P r ) <, (6) or q > 4, M q (P r ) <. (7) Prob ( W Dφ (P n, P r ) ɛ ) a(n, ɛ)1 {ɛ L } + b(n, ɛ), 2 where a(n, ɛ) = C 1 ( exp ( exp ( exp ) 4cnɛ2 L 2 4cnɛ2 cn ( 2ɛ L, 1 d 3, log 2 ( 2 + L 2ɛ) ) L, d = 4, 2 ) ) d 2, d 5, and b(n, ɛ) = C 2 ( exp ( exp n ( 2nɛ L ) α ) 2 cn ( 2ɛ L 1 {ɛ> L }, under condition (5), ) 2 ( c( 2nɛ L ) α ɛ 2 1 {ɛ L 2 } + exp c ( ) α ) 2nɛ 2 L 1 {ɛ> L }, 0 < ɛ < α, under condition (6), 2 ) q ɛ 2, 0 < ɛ < q, under condition (7). where c, C 1 and C 2 are constants depending on q and d. 3.2 Continuity, Differentiability and Duality Representation In this subsection, we establish the continuity, differentiabililty and duality representation of RW divergence, demonstrating that RW divergence is a reasonable choice for the GANs. We first present a simple yet important lemma. 8
9 Lemma 3.4 (Decomposition of RW divergence). The RW divergence can be decomposed in terms of the distorted squared Wasserstein-L 2 divergence of order 2 with several additional residual terms independent of the choice of coupling π, i.e., [ W Dφ (P, Q) = W2 L2 ( P, Q ( φ) 1 )] 2 [ φ(x) 1 2 x 2 2 The relationship is also presented in Figure 1, ] P(dx) + [ φ(x), x φ(x) 12 ] φ(x) 2 Q(dx). Q ( φ) 1 Q φ W Dφ (P,Q) W L2 2 (P,Q φ) P Figure 1: The decomposition of W Dφ where the solid arrow denotes transformation and the dashed arrows denote the divergences between probability distributions. Proof. First, we need to prove that the inverse of φ is well-defined. Recall that φ : R is a strictly convex and twice-differentiable function, then we have 2 φ(x) 0, for x. That is to say, the gradient mapping φ : R d has a positive-definite Jacobian matrix at each point. Applying the mean value theorem yields that φ is injective so the inverse of φ exists and is bijective. Denote it as ( φ) 1 : φ( ), then Q ( φ) 1 : R d R is also a probability distribution. Thus W Dφ (P, Q) = inf [φ(x) φ(y) φ(y), x y ] π(dx, dy) π Π(P,Q) [ 1 = inf π Π(P,Q) 2 x ] 2 φ(y) 2 2 φ(y), x π(dx, dy) [ + φ(x) 1 ] 2 x 2 2 π(dx, dy) + [ φ(y), y φ(y) 12 ] φ(y) 2 [ 1 = inf π Π(P,Q) 2 x ] 2 φ(y) 2 2 φ(y), x π(dx, dy) [ + φ(x) 1 ] 2 x 2 2 P(dx) + [ φ(x), x φ(x) 12 ] φ(x) 2 Q(dx). π(dx, dy) 9
10 Furthermore, observe that [ ( W2 L2 P, Q ( φ) 1 )] 2 Therefore, [ W Dφ (P, Q) = This completes the proof. W2 L2 = inf x y 2 π Π(P,Q ( φ) 1 2 π(dx, dy) ) R d = inf x φ(y) 2 2 π(dx, dy) π Π(P,Q) R d ] = inf [ x π Π(P,Q) φ(y) φ(y), x π(dx, dy). ( P, Q ( φ) 1 )] 2 [ φ(x) 1 2 x 2 2 ] P(dx) + [ φ(x), x φ(x) 12 ] φ(x) 2 Q(dx). Now we are ready to present our main results on the continuity and differentiability of the parametrized RW divergence in the generative modeling. Definition 3.3 (Generative modeling). The procedure of generative modeling is to approximate the unknown probability distribution P r by constructing a class of suitable parametric probability distributions P θ. More specifically, we define a latent variable Z Z with a fixed probability distribution P Z and a sequence of parametric functions g θ : Z. Then P θ is defined as the probability distribution of g θ (Z). Theorem 3.5 (Continuity and Differentiability of RW divergence). We have the following two statements about RW divergence: 1. W Dφ (P r, P θ ) is continuous in θ if g θ is continuous in θ. 2. W Dφ (P r, P θ ) is differentiable almost everywhere if g θ is locally Lipschitz with a constant L(θ, z) such that E [ L(θ, Z) 2] <, i.e., for each given (θ 0, z 0 ), there exists a neighborhood N such that g θ (z) g θ0 (z 0 ) 2 L(θ 0, z 0 ) ( θ θ z z 0 2 ). for any (θ, z) N. Proof. It follows from Lemma 3.4 that where T 1 = 1 2 T 2 = [ W2 L2 ( Pr, P θ ( φ) 1)] 2, ] [ φ(x) 1 2 x 2 2 W Dφ (P r, P θ ) = T 1 + T 2, P r (dx) + [ φ(x), x φ(x) 12 ] φ(x) 2 P θ (dx). We observe that T 2 is continuous and differentiable with respect to θ since φ is a twice differentiable function. Furthermore, since ( φ) 1 is also continuous and differentiable, it suffices to show that W L2 2 (P r, P θ ) is 10
11 continuous in θ if g θ is continuous in θ, and differentiable almost everywhere if g θ is locally Lipschitz with a constant L(θ, z) such that E [ L(θ, Z) 2] < for any θ. Given two vectors θ 0, θ R d, we define π as a joint distribution of (g θ (Z), g θ0 (Z)) where Z P Z, then where W L2 2 (P θ, P θ0 ) = ( ( Z x y 2 2 π(dx, dy) ) 1/2 g θ (z) g θ0 (z) 2 2 P Z(dz)) 1/2, g θ (z) g θ0 (z) 2 2 0, z Z, since g θ is continuous in θ. Furthermore, g θ1 (z) g θ2 (z) 2 2 is uniformly bounded on Z since g θ(x) and is a compact set. Therefore, applying the bounded convergence theorem yields W2 L2 (P r, P θ ) W2 L2 (P r, P θ0 ) W2 L2 (P θ, P θ0 ) ( g θ (z) g θ0 (z) 2 2 P Z(dz) Z 0, as θ θ 0. where the first inequality comes from the triangle inequality. Given a pair (θ 0, z 0 ), the local Lipschitz continuity of g θ implies that there exists a neighborhood N such that g θ (z) g θ0 (z 0 ) 2 L(θ 0, z 0 ) ( θ θ z z 0 2 ). ) 1/2 for any (θ, z) N. Then g θ (z 0 ) g θ0 (z 0 ) 2 2 P Z(dz 0 ) Z Z [L(θ 0, z 0 )] 2 θ θ P Z (dz 0 ) = θ θ E [ L(θ 0, Z) 2]. Therefore, W2 L2 (P r, P θ ) W2 L2 (P r, P θ0 ) W2 L2 (P θ, P θ0 ) ( g θ (z 0 ) g θ0 (z 0 ) 2 2 P Z(dz 0 ) Z θ θ 0 2 E [ L(θ, Z) 2] 1/2, ) 1/2 which implies that W L2 2 (P r, P θ ) is locally Lipschitz. Applying the Rademacher s theorem [13] yields that W L2 2 (P r, P θ ) is differentiable with respect to θ almost everywhere. This completes the proof. Next, we turn to present our results on the duality representation of RW divergence. Theorem 3.6 (Duality Representation of RW divergence). Assume that two probability distributions P and Q such that x 2 2 (P + Q) (dx) < +. 11
12 Then there exists a Lipschitz continuous function f : R such that the RW divergence has a duality representation as W Dφ (P, Q) = ( ) φ(x) (P Q) (dx) + φ(x), x Q(dx) f(x) P(dx) + f ( φ(x)) Q(dx), where f is the conjugate of f, i.e., f (y) = sup x R d x, y f(x). Proof. Given two probability distributions P and Q that satisfy x 2 2 (P + Q) (dx) < +, it follows from Proposition 3.1 [6] that there exists a Lipschitz continuous function f : R such that the squared Wasserstein-L 2 divergence of order 2 has a duality representation, i.e., [ 2 W2 L2 (P, Q)] = inf x y 2 2 π(dx, dy) π Π(P,Q) ( ) = x 2 2 (P + Q) (dx) 2 f(x) P(dx) + f (x) Q(dx), where f is the convex conjugate of f, i.e., f (y) = sup x R d x, y f(x). Combining Lemma 3.4 yields that W Dφ (P, Q) = 1 [ ( W2 L2 P, Q ( φ) 1 )] 2 2 [ + φ(x) 1 ] 2 x 2 2 P(dx) + [ φ(x), x φ(x) 12 ] φ(x) 2 Q(dx), = 1 ( ) ( ) x P(dx) + φ(x) 2 2 Q(dx) f(x) P(dx) + f ( φ(x)) Q(dx) [ + φ(x) 1 ] 2 x 2 2 P(dx) + [ φ(x), x φ(x) 12 ] φ(x) 2 Q(dx) ( ) = φ(x) (P Q) (dx) + φ(x), x Q(dx) f(x) P(dx) + f ( φ(x)) Q(dx). This completes the proof. Finally, we show that Theorem 3.6 allows for an explicit formula for the gradient evaluation in the generative modeling (Definition 3.3), providing the theoretical guarantee for the RWGANs training. Corollary (Gradient Evaluation). Under the setting of generative modeling, we assume that g θ is locally Lipschitz with a constant L(θ, z) such that E [ L(θ, Z) 2] < and x 2 2 (P r + P θ ) (dx) < +. Then there exists a Lipschitz continuous solution f : R such that the gradient of the RW divergence has an explicit form, i.e., [ θ WDφ (P r, P θ ) ] ] = E Z [[ θ g θ (Z)] 2 φ(g θ (Z))g θ (Z) + E Z [ θ f ( φ(g θ (Z)))]. 12
13 Proof. Under the conditions that g θ is a locally Lipschitz and x 2 2 (P r + P θ ) (dx) < +, it follows from Theorem 3.5 and Theorem 3.6 that W Dφ (P r, P θ ) is differentiable almost everywhere and there exists a Lipschitz continuous function f : R such that the RW divergence has a duality representation as W Dφ (P r, P θ ) = ( φ(x) (P r P θ ) (dx)+ φ(x), x P θ (dx) f(x) P r (dx) + ) f ( φ(x)) P θ (dx). By the envelope theorem [26], we obtain that [ θ WDφ (P r, P θ ) ] ] = θ [ φ(x) P θ (dx) + φ(x), x P θ (dx) f ( φ(x)) P θ (dx) ] = θ [ φ(g θ (z)) P Z (dz) + φ(g θ (z)), g θ (z) P Z (dz) f ( φ(g θ (z))) P Z (dz) Z Z Z = [ θ g θ (z)] φ(g θ (z)) P Z (dz) + [ θ g θ (z)] φ(g θ (z)) P Z (dz) Z Z + [ θ g θ (z)] 2 φ(g θ (z))g θ (z) P Z (dz) θ f ( φ(g θ (z))) P Z (dz) Z Z = [ θ g θ (z)] 2 φ(g θ (z))g θ (z) P Z (dz) θ f ( φ(g θ (z))) P Z (dz) Z Letting f = f, we conclude that [ θ WDφ (P r, P θ ) ] ] = E Z [[ θ g θ (Z)] 2 φ(g θ (Z))g θ (Z) + E Z [ θ f ( φ(g θ (Z)))], where f is Lipschitz continuous. This completes the proof. Z 4 Empirical Results In this section, we will present numerical evaluation on image generations to demonstrate the effectiveness and efficiency of using RW in GANs. We will first describe our approach of RW in GANs (RWGANs) and the experimental setting (Section 4.1), and then report the experimental results under RWGANs and other nine well-established variants of GANs (Section 4.2). 4.1 Experimental Approach and Setting RWGANs approach. The goal of the GANs is to estimate a probability distribution P r hidden in the data. As defined in Definition 3.3, one can define a random variable Z with a fixed distribution P Z and pass it through a parametric function g θ : Z to construct a probability distribution P θ. In this light, one can learn the probability distribution P r by adapting θ and fitting the data with P θ. This approximation is done by finding a solution f that optimizes a given cost function between P r and P θ. Despite the theoretical explicit formulas derived in the duality representation and the gradient evaluation (Theorem 3.6 and Corollary 3.6.1), it is infeasible to directly compute such an f in practice. Nevertheless, 13
14 since the Wasserstein divergence is parametrized by any convex function in RWGANs, it provides a great deal of flexibility in the choice of loss functions. For example, one can choose an appropriate φ such that [ θ WDφ (P r, P θ ) ] E Z [ θ f ( φ(g θ (z)))]. In our experiment, we try the KL divergence where 2 φ(x) = diag(1/x) since we observe that ] [ ] E Z [[ θ g θ (Z)] 2 φ(g θ (Z))g θ (Z) = E Z [ θ g θ (Z)] 1 C, where C is a constant depending on the Lipschitz constant of g θ. This implies that this term is controlled by θ during the process of training. The numerical results confirm the effectiveness of our heuristic in practice. Experimental framework. Our experimental framework is similar to the one in WGANs [2] in that a) we apply back-propagation to train the generator and discriminator networks, and b) we update the parameters once in the generative model and n critic times in the discriminator network. Our framework differs from the WGANs [2] in several aspects. First, we use φ to do the asymmetric clipping instead of the symmetric clipping. Note that the asymmetric clipping guarantees the Lipschitz continuity of f and φ(w) [ c, c]. Second, we use a scaling parameter S to stabilize the asymmetric clipping. This is critical for the experiment. Finally, we adopt RMSProp [34] instead of ADAM [19], which allows a choice of a larger step-size and avoids the non-stationary problem [28]. We describe our method with default parameters in Algorithm 1. Algorithm 1 RWGANs. The default values α = , c = 0.005, S = 0.01, m = 64, n critic = 5. Require: α: the learning rate; c: the clipping parameter; m: the batch size; n critic, the number of iterations of the critic per generator iteration; N max, the maximum number of one forward pass and one backward pass of all the training examples. Require: w 0, initial critic parameters; θ 0 : initial generator s parameters. for N = 1, 2,..., N max do for t = 0,..., n critic do Sample a batch of real data {x i } m i=1 from P r. Sample a batch of prior samples {z i } m i=1 from p(z). g w 1 m m i=1 [ wf w (x i ) w f w (g θ (z i ))]. w w + α RMSProp(w, g w ). w clip ( w, S ( φ) 1 ( c), S ( φ) 1 (c) ). end for Sample a batch of prior samples {z i } m i=1 from p(z). g θ 1 m m i=1 θf w ( φ(g θ (z i ))). θ θ α RMSProp(θ, g θ ). end for 14
15 Experimental setting. In order to test RWGANs, we adopt nine baseline methods as discussed in the introduction. They are RWGANs, WGANs [2], WGANs-GP [17], CGANs [27], InfoGANs [10], GANs [16], LSGANs [24], DRAGANs [20], BEGANs [5], EBGANs [38], and ACGANs [29]. The implementation of all these approaches is based on publicly available online information 1. In addition, we use the following four standard and well-known datasets in our experiment. 1. MNIST 2 is a dataset of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. 2. Fashion-MNIST 3 is an alternative dataset of Zalando s article images to MNIST. It consists of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a gray-scale image, associated with a label from 10 classes. 3. The CIFAR-10 4 dataset consists of color images in 10 classes, with 6000 images per class. There are training images and test images. 4. The ImageNet 5 dataset is a large visual database designed for visual object recognition research. As of 2016, over ten million URLs of images have been hand-annotated by ImageNet to indicate which objects are in the picture. In at least one million of the images, bounding boxes are also provided. Metric. The negative critic loss, well-known as the standard quantitative metric, is used in all our experiments. In addition to the negative critic loss, we use the inception score [33] to evaluate samples generated by three WGANs methods on CIFAR-10 and ImageNet. The inception score is defined as follows: Inception_Score = exp {E x [D KL (p(y x), p(y)]}, where p(y x) is given by the inception network. A high inception score is an indicator that the images generated by the model are highly interpretable and diversified. It is also highly correlated with human evaluation of the images. 4.2 Experimental Result Experiments on MNIST and Fashion-MNIST: We start our experiment by training models using the ten different GANs procedures on MNIST and Fashion-MNIST. The architecture is DCGAN [31] and the maximum number of epochs is 100. Figure 3 shows the training curves of the negative critic loss of all candidate approaches. The figure indicates that RWGANs and WGANs are stable with the smallest variances, where RWGANs has a slight kriz/cifar.html x64.tar 15
16 higher variance partly due to the use of a larger step-size and asymmetric clipping. This slightly higher variance, nevertheless, speeds up the rate of training. Indeed, as illustrated in Figure 4 and Figure 5, RWGANs is the fastest to generate meaningful images. Note that CGANs and InfoGANs seem faster but the images they have generated are not meaningful, as they fall into bad local optima from an optimization perspective. Experiments on CIFAR-10 and ImageNet: After observing that WGANs and RWGANs perform the best among all the variants of GANs, we proceed to compare RWGANs and WGANs, together with WGANs with Gradient Penalty (WGANs-GP), on two much larger datasets CIFAR-10 and ImageNet. Here the architectures used are DCGAN and ReLU-MLP [11] and the maximum number of epochs is set to 25. Figure 6 shows the training curves of the negative critic loss of all candidate approaches again. Except for the small variance of WGANs, we observe that, in terms of the negative loss, WGANs-GP tend to diverge as the training progresses, implying that such method might not be robust in practice despite its fast rate of training. In this case, RWGANs achieve relatively low variance with convergent negative critic loss, leading to a trade-off between robustness and efficiency. We then evaluate the candidate methods with the inception score and present the results in Table 2. The table shows that RWGANs are often the fastest method. They perform the best in three out of four cases during several early epochs, and obtain images with competitively high quality at the final stage. Figures 7, Figure 8, Figure 9 and Figure 10 show the sample qualities of the image generated at the initial and final stages, which strongly supports our conclusion. Architecture DCGAN MLP Method CIFAR-10 ImageNet First 5 epochs Last 10 epochs First 3 epochs Last 5 epochs RWGANs WGANs WGANs-GP RWGANs WGANs WGANs-GP Figure 2: Inception scores at the beginning and final stages of training. DCGAN refers to the standard DCGAN generator and MLP refers to an ReLU-MLP with 4 hidden layers and 512 units at each layer. 5 Conclusion We propose a novel class of statistical divergence called RW divergence and establish several important theoretical properties, crucial for the GANs training. The experiments, with RW parametrized by the KL divergence in image generation, show that RWGANs is a promising trade-off between WGANs and WGANs-GP, achieving both the robustness and efficiency during the learning process. The asymmetric clipping in RWGANs is a viable alternative to the gradient penalty and the symmetric clipping in WGANs, 16
17 avoiding the low-quality samples and the failure of convergence. The flexible framework of RW divergences opens a door to the implementation and comparison of different loss functions for GANs. Meanwhile it raises a natural question on whether one can select φ according to the statistics of data and the structure of the problem. References [1] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. Ariv Preprint: , [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. ICML, pages , [3] A. Banerjee,. Guo, and H. Wang. On the optimality of conditional expectation as a Bregman predictor. IEEE Transactions on Information Theory, 51(7): , [4] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. Journal of Machine Learning Research, 6(Oct): , [5] D. Berthelot, T. Schumm, and L. Metz. BEGAN: Boundary equilibrium generative adversarial networks. Ariv Preprint: , [6] Y. Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics, 44(4): , [7] L. Caffarelli. Some regularity properties of solutions of Monge Ampère equation. Communications on Pure and Applied Mathematics, 44(8-9): , [8] L. Caffarelli. The regularity of mappings with a convex potential. Journal of the American Mathematical Society, 5(1):99 104, [9] S. Chen and A. Figalli. Partial W 2,p regularity for optimal transport maps. Journal of Functional Analysis, 272(11): , [10]. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. NIPS, pages , [11] B. Conan-Guez and F. Rossi. Multi-Layer Perceptrons for functional data analysis: a projection based approach. Artificial Neural Networks ICANN 2002, [12] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a Laplacian pyramid of adversarial networks. NIPS, pages , [13] L. Evans and R. Gariepy. Measure Theory and Fine Properties of Functions. CRC Press,
18 [14] N. Fournier and A. Guillin. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3-4): , [15] A. Ghosh, V. Kulharia, A. Mukerjee, V. Namboodiri, and M. Bansal. Contextual RNN-GANs for abstract reasoning diagram generation. AAAI, pages , [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. u, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. NIPS, pages , [17] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. Ariv Preprint: , [18] L. K. Jones and C. L. Byrne. General entropy criteria for inverse problems, with applications to data compression, pattern classification, and cluster analysis. IEEE Transactions on Information Theory, [19] D. Kingma and J. Ba. Adam: A method for stochastic optimization. Ariv Preprint: , [20] N. Kodali, J. Abernethy, J. Hays, and Z. Kira. How to train your DRAGAN. Ariv Preprint: , [21] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. CoRR, abs/ , [22] J. Li, A. Madry, J. Peebles, and L. Schmidt. Towards understanding the dynamics of generative adversarial networks. Ariv Preprint: , [23] P. Luc, C. Couprie, S. Chintala, and J. Verbeek. Semantic segmentation using adversarial networks. In NIPS Workshop on Adversarial Training, [24]. Mao, Q. Li, H. ie, R. Y. K. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. Ariv Preprint: , [25] L. Mescheder, S. Nowozin, and A. Geiger. Adversarial variational Bayes: Unifying variational autoencoders and generative adversarial networks. In ICML, pages , [26] P. Milgrom and I. Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2): , [27] M. Mirza and S. Osindero. Conditional generative adversarial nets. Ariv Preprint: , [28] V. Mnih, A. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. ICML, pages , [29] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier GANs. ICML, pages ,
19 [30] M. C. Pardo and I. Vajda. On asymptotic properties of information-theoretic divergences. IEEE Transactions on Information Theory, 49(7): , [31] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. Arxiv Preprint: , [32] S. Reed, Z. Akata,. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text-to-image synthesis. ICML, pages , [33] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and. Chen. Improved techniques for training GANs. NIPS, pages , [34] T. Tieleman and G. Hinton. Lecture 6.5-RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2), [35] C. Villani. Optimal Transport: Old and New, volume 338. Springer Science & Business Media, [36] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. NIPS, pages , [37] R. Yeh, C. Chen, T. Y. Lim, M. Hasegawa-Johnson, and M. N. Do. Semantic image inpainting with perceptual and contextual losses. Ariv Preprint: , [38] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. Ariv Preprint: , [39] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In ECCV, pages ,
20 Method MNIST Fashion-MNIST Method RWGANs LSGANs WGANs DRAGANs CGANs BEGANs InfoGANs EBGANs GANs ACGANs MNIST Fashion-MNIST Figure 3: Training curves of the negative critic loss at different stages of training on MNIST and FashionMNIST. Gloss and Dloss refer to the loss in generative and discriminative nets, which is plotted in orange and blue lines, respectively. 20
21 Method N = 1 N = 10 N = 25 N = 100 RWGANs WGANs CGANs InfoGANs GANs 21
22 Method N = 1 N = 10 N = 25 N = 100 LSGANs DRAGANs BEGANs EBGANs ACGANs Figure 4: Sample qualities at different stages of training on MNIST. 22
23 Method N = 1 N = 10 N = 25 N = 100 RWGANs WGANs CGANs InfoGANs GANs 23
24 Method N = 1 N = 10 N = 25 N = 100 LSGANs DRAGANs BEGANs EBGANs ACGANs Figure 5: Sample qualities at different stages of training on Fashion-MNIST. 24
25 Method Architecture CIFAR-10 ImageNet RWGANs DCGAN MLP WGANs DCGAN MLP WGANs-GP DCGAN MLP Figure 6: Training curves at different stages of training. DCGAN refers to the standard DCGAN generator and MLP refers to an ReLU-MLP with 4 hidden layers and 512 units at each layer. G loss and D loss refer to the loss in generative and discriminative nets. The loss in RWGANs is shown to converge consistently while the loss in WGANs-GP tends to diverge as the training progresses. WGANs achieves the lowest variance among the three methods. 25
26 Method DCGAN N = 1 MLP RWGANs WGANs WGANs- GP Figure 7: Sample qualities at the initial stage of training on CIFAR
27 Method DCGAN N = 100 MLP RWGANs WGANs WGANs- GP Figure 8: Sample qualities at the final stage of training on CIFAR
28 Method DCGAN N = 1 MLP RWGANs WGANs WGANs- GP Figure 9: Sample qualities at the initial stage of training on ImageNet. 28
29 Method DCGAN N = 25 MLP RWGANs WGANs WGANs- GP Figure 10: Sample qualities at the final stage of training on ImageNet. 29
Lecture 14: Deep Generative Learning
Generative Modeling CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 14: Deep Generative Learning Density estimation Reconstructing probability density function using samples Bohyung Han
More informationGenerative Adversarial Networks
Generative Adversarial Networks SIBGRAPI 2017 Tutorial Everything you wanted to know about Deep Learning for Computer Vision but were afraid to ask Presentation content inspired by Ian Goodfellow s tutorial
More informationNishant Gurnani. GAN Reading Group. April 14th, / 107
Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107 Why are these Papers Important? 2 / 107 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN,
More informationSinging Voice Separation using Generative Adversarial Networks
Singing Voice Separation using Generative Adversarial Networks Hyeong-seok Choi, Kyogu Lee Music and Audio Research Group Graduate School of Convergence Science and Technology Seoul National University
More informationarxiv: v1 [cs.lg] 20 Apr 2017
Softmax GAN Min Lin Qihoo 360 Technology co. ltd Beijing, China, 0087 mavenlin@gmail.com arxiv:704.069v [cs.lg] 0 Apr 07 Abstract Softmax GAN is a novel variant of Generative Adversarial Network (GAN).
More informationNegative Momentum for Improved Game Dynamics
Negative Momentum for Improved Game Dynamics Gauthier Gidel Reyhane Askari Hemmat Mohammad Pezeshki Gabriel Huang Rémi Lepriol Simon Lacoste-Julien Ioannis Mitliagkas Mila & DIRO, Université de Montréal
More informationGENERATIVE ADVERSARIAL LEARNING
GENERATIVE ADVERSARIAL LEARNING OF MARKOV CHAINS Jiaming Song, Shengjia Zhao & Stefano Ermon Computer Science Department Stanford University {tsong,zhaosj12,ermon}@cs.stanford.edu ABSTRACT We investigate
More informationWasserstein GAN. Juho Lee. Jan 23, 2017
Wasserstein GAN Juho Lee Jan 23, 2017 Wasserstein GAN (WGAN) Arxiv submission Martin Arjovsky, Soumith Chintala, and Léon Bottou A new GAN model minimizing the Earth-Mover s distance (Wasserstein-1 distance)
More informationTraining Generative Adversarial Networks Via Turing Test
raining enerative Adversarial Networks Via uring est Jianlin Su School of Mathematics Sun Yat-sen University uangdong, China bojone@spaces.ac.cn Abstract In this article, we introduce a new mode for training
More informationarxiv: v3 [stat.ml] 20 Feb 2018
MANY PATHS TO EQUILIBRIUM: GANS DO NOT NEED TO DECREASE A DIVERGENCE AT EVERY STEP William Fedus 1, Mihaela Rosca 2, Balaji Lakshminarayanan 2, Andrew M. Dai 1, Shakir Mohamed 2 and Ian Goodfellow 1 1
More informationWhich Training Methods for GANs do actually Converge?
Lars Mescheder 1 Andreas Geiger 1 2 Sebastian Nowozin 3 Abstract Recent work has shown local convergence of GAN training for absolutely continuous data and generator distributions. In this paper, we show
More informationA QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS
A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS Dan Hendrycks University of Chicago dan@ttic.edu Steven Basart University of Chicago xksteven@uchicago.edu ABSTRACT We introduce a
More informationSome theoretical properties of GANs. Gérard Biau Toulouse, September 2018
Some theoretical properties of GANs Gérard Biau Toulouse, September 2018 Coauthors Benoît Cadre (ENS Rennes) Maxime Sangnier (Sorbonne University) Ugo Tanielian (Sorbonne University & Criteo) 1 video Source:
More informationSOLVING LINEAR INVERSE PROBLEMS USING GAN PRIORS: AN ALGORITHM WITH PROVABLE GUARANTEES. Viraj Shah and Chinmay Hegde
SOLVING LINEAR INVERSE PROBLEMS USING GAN PRIORS: AN ALGORITHM WITH PROVABLE GUARANTEES Viraj Shah and Chinmay Hegde ECpE Department, Iowa State University, Ames, IA, 5000 In this paper, we propose and
More informationSupplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization
Supplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization Sebastian Nowozin, Botond Cseke, Ryota Tomioka Machine Intelligence and Perception Group
More informationarxiv: v3 [cs.lg] 11 Jun 2018
Lars Mescheder 1 Andreas Geiger 1 2 Sebastian Nowozin 3 arxiv:1801.04406v3 [cs.lg] 11 Jun 2018 Abstract Recent work has shown local convergence of GAN training for absolutely continuous data and generator
More informationMultiplicative Noise Channel in Generative Adversarial Networks
Multiplicative Noise Channel in Generative Adversarial Networks Xinhan Di Deepearthgo Deepearthgo@gmail.com Pengqian Yu National University of Singapore yupengqian@u.nus.edu Abstract Additive Gaussian
More informationMMD GAN 1 Fisher GAN 2
MMD GAN 1 Fisher GAN 1 Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos (CMU, IBM Research) Youssef Mroueh, and Tom Sercu (IBM Research) Presented by Rui-Yi(Roy) Zhang Decemeber
More informationVariational Autoencoders (VAEs)
September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p
More informationarxiv: v1 [cs.lg] 8 Dec 2016
Improved generator objectives for GANs Ben Poole Stanford University poole@cs.stanford.edu Alexander A. Alemi, Jascha Sohl-Dickstein, Anelia Angelova Google Brain {alemi, jaschasd, anelia}@google.com arxiv:1612.02780v1
More informationON ADVERSARIAL TRAINING AND LOSS FUNCTIONS FOR SPEECH ENHANCEMENT. Ashutosh Pandey 1 and Deliang Wang 1,2. {pandey.99, wang.5664,
ON ADVERSARIAL TRAINING AND LOSS FUNCTIONS FOR SPEECH ENHANCEMENT Ashutosh Pandey and Deliang Wang,2 Department of Computer Science and Engineering, The Ohio State University, USA 2 Center for Cognitive
More informationImportance Reweighting Using Adversarial-Collaborative Training
Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a
More informationDeep Generative Models. (Unsupervised Learning)
Deep Generative Models (Unsupervised Learning) CEng 783 Deep Learning Fall 2017 Emre Akbaş Reminders Next week: project progress demos in class Describe your problem/goal What you have done so far What
More informationEnergy-Based Generative Adversarial Network
Energy-Based Generative Adversarial Network Energy-Based Generative Adversarial Network J. Zhao, M. Mathieu and Y. LeCun Learning to Draw Samples: With Application to Amoritized MLE for Generalized Adversarial
More informationarxiv: v1 [eess.iv] 28 May 2018
Versatile Auxiliary Regressor with Generative Adversarial network (VAR+GAN) arxiv:1805.10864v1 [eess.iv] 28 May 2018 Abstract Shabab Bazrafkan, Peter Corcoran National University of Ireland Galway Being
More informationUnderstanding GANs: Back to the basics
Understanding GANs: Back to the basics David Tse Stanford University Princeton University May 15, 2018 Joint work with Soheil Feizi, Farzan Farnia, Tony Ginart, Changho Suh and Fei Xia. GANs at NIPS 2017
More informationNonparametric Inference for Auto-Encoding Variational Bayes
Nonparametric Inference for Auto-Encoding Variational Bayes Erik Bodin * Iman Malik * Carl Henrik Ek * Neill D. F. Campbell * University of Bristol University of Bath Variational approximations are an
More informationGenerative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,
Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab, 2016-08-31 Generative Modeling Density estimation Sample generation
More informationBregman Divergences for Data Mining Meta-Algorithms
p.1/?? Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin ghosh@ece.utexas.edu Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon,
More informationarxiv: v4 [cs.cv] 5 Sep 2018
Wasserstein Divergence for GANs Jiqing Wu 1, Zhiwu Huang 1, Janine Thoma 1, Dinesh Acharya 1, and Luc Van Gool 1,2 arxiv:1712.01026v4 [cs.cv] 5 Sep 2018 1 Computer Vision Lab, ETH Zurich, Switzerland {jwu,zhiwu.huang,jthoma,vangool}@vision.ee.ethz.ch,
More informationA Unified View of Deep Generative Models
SAILING LAB Laboratory for Statistical Artificial InteLigence & INtegreative Genomics A Unified View of Deep Generative Models Zhiting Hu and Eric Xing Petuum Inc. Carnegie Mellon University 1 Deep generative
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationDeep Generative Image Models using a Laplacian Pyramid of Adversarial Networks
Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks Emily Denton 1, Soumith Chintala 2, Arthur Szlam 2, Rob Fergus 2 1 New York University 2 Facebook AI Research Denotes equal
More informationtopics about f-divergence
topics about f-divergence Presented by Liqun Chen Mar 16th, 2018 1 Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments
More informationDistirbutional robustness, regularizing variance, and adversaries
Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationAdaGAN: Boosting Generative Models
AdaGAN: Boosting Generative Models Ilya Tolstikhin ilya@tuebingen.mpg.de joint work with Gelly 2, Bousquet 2, Simon-Gabriel 1, Schölkopf 1 1 MPI for Intelligent Systems 2 Google Brain Radford et al., 2015)
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationDeep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?
More informationGenerative Adversarial Networks, and Applications
Generative Adversarial Networks, and Applications Ali Mirzaei Nimish Srivastava Kwonjoon Lee Songting Xu CSE 252C 4/12/17 2/44 Outline: Generative Models vs Discriminative Models (Background) Generative
More informationLarge-Scale Feature Learning with Spike-and-Slab Sparse Coding
Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab
More informationLearning to Sample Using Stein Discrepancy
Learning to Sample Using Stein Discrepancy Dilin Wang Yihao Feng Qiang Liu Department of Computer Science Dartmouth College Hanover, NH 03755 {dilin.wang.gr, yihao.feng.gr, qiang.liu}@dartmouth.edu Abstract
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationPredicting Deeper into the Future of Semantic Segmentation Supplementary Material
Predicting Deeper into the Future of Semantic Segmentation Supplementary Material Pauline Luc 1,2 Natalia Neverova 1 Camille Couprie 1 Jakob Verbeek 2 Yann LeCun 1,3 1 Facebook AI Research 2 Inria Grenoble,
More informationProbabilistic Graphical Models
10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem
More informationNormalization Techniques in Training of Deep Neural Networks
Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,
More informationOpen Set Learning with Counterfactual Images
Open Set Learning with Counterfactual Images Lawrence Neal, Matthew Olson, Xiaoli Fern, Weng-Keen Wong, Fuxin Li Collaborative Robotics and Intelligent Systems Institute Oregon State University Abstract.
More informationarxiv: v7 [cs.lg] 27 Jul 2018
How Generative Adversarial Networks and Their Variants Work: An Overview of GAN Yongjun Hong, Uiwon Hwang, Jaeyoon Yoo and Sungroh Yoon Department of Electrical & Computer Engineering Seoul National University,
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationComposite Functional Gradient Learning of Generative Adversarial Models. Appendix
A. Main theorem and its proof Appendix Theorem A.1 below, our main theorem, analyzes the extended KL-divergence for some β (0.5, 1] defined as follows: L β (p) := (βp (x) + (1 β)p(x)) ln βp (x) + (1 β)p(x)
More informationFirst Order Generative Adversarial Networks
Calvin Seward 1 2 Thomas Unterthiner 2 Urs Bergmann 1 Nikolay Jetchev 1 Sepp Hochreiter 2 Abstract GANs excel at learning high dimensional distributions, but they can update generator parameters in directions
More informationUnderstanding GANs: the LQG Setting
Understanding GANs: the LQG Setting Soheil Feizi 1, Changho Suh 2, Fei Xia 1 and David Tse 1 1 Stanford University 2 Korea Advanced Institute of Science and Technology arxiv:1710.10793v1 [stat.ml] 30 Oct
More informationCONTINUOUS-TIME FLOWS FOR EFFICIENT INFER-
CONTINUOUS-TIME FLOWS FOR EFFICIENT INFER- ENCE AND DENSITY ESTIMATION Anonymous authors Paper under double-blind review ABSTRACT Two fundamental problems in unsupervised learning are efficient inference
More informationMachine Learning Summer 2018 Exercise Sheet 4
Ludwig-Maimilians-Universitaet Muenchen 17.05.2018 Institute for Informatics Prof. Dr. Volker Tresp Julian Busch Christian Frey Machine Learning Summer 2018 Eercise Sheet 4 Eercise 4-1 The Simpsons Characters
More informationDifferentiable Fine-grained Quantization for Deep Neural Network Compression
Differentiable Fine-grained Quantization for Deep Neural Network Compression Hsin-Pai Cheng hc218@duke.edu Yuanjun Huang University of Science and Technology of China Anhui, China yjhuang@mail.ustc.edu.cn
More informationGenerative models for missing value completion
Generative models for missing value completion Kousuke Ariga Department of Computer Science and Engineering University of Washington Seattle, WA 98105 koar8470@cs.washington.edu Abstract Deep generative
More informationDeep Learning Year in Review 2016: Computer Vision Perspective
Deep Learning Year in Review 2016: Computer Vision Perspective Alex Kalinin, PhD Candidate Bioinformatics @ UMich alxndrkalinin@gmail.com @alxndrkalinin Architectures Summary of CNN architecture development
More informationIntroduction to Deep Neural Networks
Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic
More informationVariational Autoencoders
Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly
More informationVariational Autoencoder
Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational
More informationarxiv: v3 [cs.lg] 25 Dec 2017
Improved Training of Wasserstein GANs arxiv:1704.00028v3 [cs.lg] 25 Dec 2017 Ishaan Gulrajani 1, Faruk Ahmed 1, Martin Arjovsky 2, Vincent Dumoulin 1, Aaron Courville 1,3 1 Montreal Institute for Learning
More informationNotes on Adversarial Examples
Notes on Adversarial Examples David Meyer dmm@{1-4-5.net,uoregon.edu,...} March 14, 2017 1 Introduction The surprising discovery of adversarial examples by Szegedy et al. [6] has led to new ways of thinking
More informationEve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates
Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive
More informationf-gan: Training Generative Neural Samplers using Variational Divergence Minimization
f-gan: Training Generative Neural Samplers using Variational Divergence Minimization Sebastian Nowozin, Botond Cseke, Ryota Tomioka Machine Intelligence and Perception Group Microsoft Research {Sebastian.Nowozin,
More informationGANs, GANs everywhere
GANs, GANs everywhere particularly, in High Energy Physics Maxim Borisyak Yandex, NRU Higher School of Economics Generative Generative models Given samples of a random variable X find X such as: P X P
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationSummary of A Few Recent Papers about Discrete Generative models
Summary of A Few Recent Papers about Discrete Generative models Presenter: Ji Gao Department of Computer Science, University of Virginia https://qdata.github.io/deep2read/ Outline SeqGAN BGAN: Boundary
More informationWasserstein Generative Adversarial Networks
Martin Arjovsky 1 Soumith Chintala 2 Léon Bottou 1 2 Abstract We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability
More informationDeep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationUNSUPERVISED LEARNING
UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training
More informationarxiv: v3 [stat.ml] 15 Oct 2017
Non-parametric estimation of Jensen-Shannon Divergence in Generative Adversarial Network training arxiv:1705.09199v3 [stat.ml] 15 Oct 2017 Mathieu Sinn IBM Research Ireland Mulhuddart, Dublin 15, Ireland
More informationCS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018
CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs
More informationImproved Training of Wasserstein GANs
Improved Training of Wasserstein GANs Ishaan Gulrajani 1, Faruk Ahmed 1, Martin Arjovsky 2, Vincent Dumoulin 1, Aaron Courville 1,3 1 Montreal Institute for Learning Algorithms 2 Courant Institute of Mathematical
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationarxiv: v2 [cs.cl] 1 Jan 2019
Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2
More informationMeasuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information
Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Mathias Berglund, Tapani Raiko, and KyungHyun Cho Department of Information and Computer Science Aalto University
More informationGenerative adversarial networks
14-1: Generative adversarial networks Prof. J.C. Kao, UCLA Generative adversarial networks Why GANs? GAN intuition GAN equilibrium GAN implementation Practical considerations Much of these notes are based
More informationEncoder Based Lifelong Learning - Supplementary materials
Encoder Based Lifelong Learning - Supplementary materials Amal Rannen Rahaf Aljundi Mathew B. Blaschko Tinne Tuytelaars KU Leuven KU Leuven, ESAT-PSI, IMEC, Belgium firstname.lastname@esat.kuleuven.be
More informationThe Success of Deep Generative Models
The Success of Deep Generative Models Jakub Tomczak AMLAB, University of Amsterdam CERN, 2018 What is AI about? What is AI about? Decision making: What is AI about? Decision making: new data High probability
More informationMGMT 69000: Topics in High-dimensional Data Analysis Falll 2016
MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence
More informationCombining PPO and Evolutionary Strategies for Better Policy Search
Combining PPO and Evolutionary Strategies for Better Policy Search Jennifer She 1 Abstract A good policy search algorithm needs to strike a balance between being able to explore candidate policies and
More informationGenerative Adversarial Networks. Presented by Yi Zhang
Generative Adversarial Networks Presented by Yi Zhang Deep Generative Models N(O, I) Variational Auto-Encoders GANs Unreasonable Effectiveness of GANs GANs Discriminator tries to distinguish genuine data
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More informationStochastic Video Prediction with Deep Conditional Generative Models
Stochastic Video Prediction with Deep Conditional Generative Models Rui Shu Stanford University ruishu@stanford.edu Abstract Frame-to-frame stochasticity remains a big challenge for video prediction. The
More informationMaxout Networks. Hien Quoc Dang
Maxout Networks Hien Quoc Dang Outline Introduction Maxout Networks Description A Universal Approximator & Proof Experiments with Maxout Why does Maxout work? Conclusion 10/12/13 Hien Quoc Dang Machine
More informationarxiv: v2 [cs.lg] 21 Aug 2018
CoT: Cooperative Training for Generative Modeling of Discrete Data arxiv:1804.03782v2 [cs.lg] 21 Aug 2018 Sidi Lu Shanghai Jiao Tong University steve_lu@apex.sjtu.edu.cn Weinan Zhang Shanghai Jiao Tong
More informationNeural Networks Learning the network: Backprop , Fall 2018 Lecture 4
Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:
More informationLocal Affine Approximators for Improving Knowledge Transfer
Local Affine Approximators for Improving Knowledge Transfer Suraj Srinivas & François Fleuret Idiap Research Institute and EPFL {suraj.srinivas, francois.fleuret}@idiap.ch Abstract The Jacobian of a neural
More informationDistance-Divergence Inequalities
Distance-Divergence Inequalities Katalin Marton Alfréd Rényi Institute of Mathematics of the Hungarian Academy of Sciences Motivation To find a simple proof of the Blowing-up Lemma, proved by Ahlswede,
More informationSajid Anwar, Kyuyeon Hwang and Wonyong Sung
Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea Email: sajid@dsp.snu.ac.kr, khwang@dsp.snu.ac.kr, wysung@snu.ac.kr
More informationarxiv: v3 [stat.ml] 6 Dec 2017
Wasserstein GAN arxiv:1701.07875v3 [stat.ml] 6 Dec 2017 Martin Arjovsky 1, Soumith Chintala 2, and Léon Bottou 1,2 1 Introduction 1 Courant Institute of Mathematical Sciences 2 Facebook AI Research The
More informationDeep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation
Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks
More informationSum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017
Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth
More informationDeep Belief Networks are compact universal approximators
1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation
More informationarxiv: v1 [cs.lg] 12 Sep 2017
Dual Discriminator Generative Adversarial Nets Tu Dinh Nguyen, Trung Le, Hung Vu, Dinh Phung Centre for Pattern Recognition and Data Analytics Deakin University, Australia {tu.nguyen,trung.l,hungv,dinh.phung}@deakin.edu.au
More informationarxiv: v3 [cs.lg] 18 Jul 2016
Adversarial Feature Learning Jeff Donahue, Philipp Krähenbühl, Trevor Darrell Computer Science Division University of California, Berkeley {jdonahue,philkr,trevor}@cs.berkeley.edu arxiv:1605.09782v3 [cs.lg
More information