arxiv: v4 [stat.ml] 16 Sep 2018

Size: px
Start display at page:

Download "arxiv: v4 [stat.ml] 16 Sep 2018"

Transcription

1 Relaxed Wasserstein with Applications to GANs in Guo Johnny Hong Tianyi Lin Nan Yang September 9, 2018 ariv: v4 [stat.ml] 16 Sep 2018 Abstract We propose a novel class of statistical divergences called Relaxed Wasserstein (RW) divergence. RW divergence generalizes Wasserstein divergence and is parametrized by a class of strictly convex and differentiable functions. We establish for RW divergence several probabilistic properties, which are critical for the success of Wasserstein divergence. In particular, we show that RW divergence is dominated by Total Variation (TV) and Wasserstein-L 2 divergence, and that RW divergence has continuity, differentiability and duality representation. Finally, we provide a non-asymptotic moment estimate and a concentration inequality for RW divergence. Our experiments on image generation demonstrate that RW divergence is a suitable choice for GANs. The performance of RWGANs with Kullback-Leibler (KL) divergence is competitive with other state-of-the-art GANs approaches. Moreover, RWGANs possess better convergence properties than the existing WGANs with competitive inception scores. To the best of our knowledge, this new conceptual framework is the first to provide not only the flexibility in designing effective GANs scheme, but also the possibility in studying different loss functions under a unified mathematical framework. 1 Introduction GANs. Generative Adversarial Networks (GANs) [16] provide a versatile class of models for generative modeling. Since their introduction to the machine learning community, the popularity of GANs have grown exponentially with numerous applications. Examples include high resolution image generation [12, 31], image inpainting [37], image super-resolution [21], visual manipulation [39], text-to-image synthesis [32], video generation [36], semantic segmentation [23], and abstract reasoning diagram generation [15]. See also [1, 22, 25] for more details on the training dynamics of GANs. The key idea behind GANs is to interpret the process of generative modeling as a competing game between two networks: a generator network and a discriminator network. The generator network attempts to fool Department of Industrial Engineering and Operations Research, University of California, Berkeley, USA. xinguo@berkeley.edu. Department of Statistics, University of California, Berkeley, USA. jcyhong@berkeley.edu. Department of Industrial Engineering and Operations Research, University of California, Berkeley, USA. darren_ lin@berkeley.edu. Department of Industrial Engineering and Operations Research, University of California, Berkeley, USA. nanyang@berkeley.edu. 1

2 the discriminator network by converting random noise into sample data while the discriminator network tries to identify whether the input sample is a faked data sample or a true data sample. There are many variants of GANs. Least square GANs (LSGANs) [24] attain a stable performance during the learning process, replacing the sigmoid cross entropy loss by the least square loss in the discriminator network of the original GANs. They are also shown to generate higher quality images than the original GANs in practice. DRAGANs [20] alleviate the instability of the GANs training and offer a clear game-theoretic justification, by introducing regret minimization to reach the equilibrium in games and to further explain the reason for the success of simultaneous gradient descent in GANs. Conditional GANs (CGANs) [27] are the conditional version of GANs, proposing to stabilize the training by imposing the control on modes of the generated data in an original generative model. Information-theoretic GANs (InfoGANs) [10] are an information-theoretic extensions to the GANs, providing highly semantic and meaningful hidden representations on a number of image datasets, by maximizing the mutual information between a fixed small subset of GAN s noises and the observations. Auxiliary Classifier GANs (ACGANs) [29] improve GANs by adding more structure to the latent space together with a specialized cost function, with high-quality samples. They also lead to a new analysis for assessing the discriminability and diversity of samples from class-conditional image synthesis models. Energy-Based GANs (EBGANs) [38] propose a new energy perspective of of GANs. They construct an energy function to measure the discriminator that attributes lower energies to the regions near the data manifold and higher energies to other regions. As a result, the EBGAN framework is shown to generate reasonable high-resolution images without a multi-scale approach. Boundary Equilibrium GANs (BEGANs) [5] adopt a new equilibrium enforcing method paired with the Wasserstein divergence to train GANs with an auto-encoder. This approach not only balances the generator network and the discriminator network but also uncovers a novel approximate convergence measure, leading to a fast and stable training with high visual-quality. WGANs. A recurring theme to improve GANs training is the choice of loss functions. The first proposed class of loss functions is based on the Jensen-Shannon (JS) divergence, which is essentially the symmetric version of the Kullback-Leibler (KL) divergence. It is shown in [2] that JS divergence is undesirable with unstable training, suggesting Wasserstein-L 1 divergence as an alternative. The resulting Wasserstein GANs (WGANs) outperform the original GANs in several aspects. The Wasserstein-L 1 divergence is continuous, differentiable and has a duality representation, allowing a very stable gradient flow in the process of training. Besides the stability, the Wasserstein-L 1 divergence also avoids the issue of mode collapse and further provides meaningful learning curves that can be used for debugging and for hyperparameter searching. With additional weight clipping [2] and gradient penalty [17], the volatility of the gradient is somehow controlled. Our work. We propose a novel class of statistical divergence called Relaxed Wasserstein (RW) divergence. RW divergence is Wasserstein divergence parametrized by a class of strictly convex and differentiable functions which contain different curvature information. Naturally, RW divergence provides more flexibility and possibilities in generative modeling. To ensure that RW divergence is a viable option for comparing probability distributions and is competitive with other Wasserstein divergence in the generative modeling, 2

3 this paper addresses the following theoretical questions along with related computational issues. Does the gradient of RW divergence exist and allows for an explicit form? Does RW divergence enjoy the same mathematical properties as the standard Wasserstein divergence? Does RW divergence have the duality representation as Wasserstein divergence? In this paper, we first show that RW divergence is dominated by the total variation (TV) distance and squared Wasserstein-L 2 divergence (Theorem 3.1). We then obtain its nonasymptotic moment estimate (Theorem 3.2), its concentration inequality (Theorem 3.3), and and its duality representation (Theorem 3.6). For application purpose, we show the existence of the gradient of RW divergence by first establishing its continuity and differentiability (Theorem 3.5). These properties ensure the gradient descent procedure, with an explicit formula for the gradient evaluation and an asymmetric clipping (Corollary 3.6.1). This asymmetric clipping is useful for controlling the volatility of the gradient. Finally, we compare the RWGANs with several state-of-the-art GANs in image generation. We use RWGANs with KL divergence and the architectures of DCGAN and MLP. We first evaluate all of candidate methods on MNIST and Fashion-MNIST datasets and show that RWGANs are competitive with other popular approaches. Then we conduct the experiment on CIFAR-10 and ImageNet datasets to investigate if RWGANs outperform WGANs with symmetric clipping and gradient penalty, denoted as WGANs and WGANs-GP respectively. Our numerical results suggest that RWGANs strike a balance between WGANs and WGANs-GP: WGANs-GP fail to converge although they can achieve the fastest rate of training, RWGANs are very robust and converge faster than WGANs. Therefore, RWGANs are more desirable for large-scale computations. Furthermore, RWGANs attain the highest inception scores at the initial stage of training on CIFAR-10, meaning that the generated samples correlate well with human evaluations [33]. As a byproduct, our experiment provides some evidences that an appropriate weight clipping has a potential to be competitive with gradient penalty in WGANs. Open question. Theoretically this new conceptual framework provides a unified mathematical framework to implement and investigate different Wasserstein divergences. Such flexibility raises a natural question on whether the underlying convex function φ can be determined in advance based on data samples and problem structure. While the main focus of this paper is the application to GANs, we believe that the theoretical results of RW divergence can be a valuable addition to the rich theory for optimal transport, where regularities of Wasserstein-based cost functions have been extensively studied [7, 8, 9, 35]. Organizations. The rest of the paper is organized as follows. Section 2 provides the preliminaries and notations that will be used throughout the paper. Section 3 describes the RW divergence and discusses its theoretical properties. In Sections 4.1 and 4.2, we discuss the implementation of the method and present two numerical studies on real data examples. Section 5 concludes our paper. 2 Background In this section, we review the definitions and properties of Bregman divergence and Wasserstein divergence. 3

4 2.1 Notations Throughout the paper, the following notations are used unless otherwise stated. We denote x R d as a vector in Euclidean space and as a matrix. x denotes the transpose of a vector x and log(x) denotes the component-wise logarithm of a vector x. 0 or 0 means that is positive semi-definite or positive definite, respectively. R d denotes a set where the diameter of is defined as diam( ) = max x 1 x 2 2. x 1,x 2 and 1 denotes an indicator function of the set. We denote P and Q as two probability distributions, P( ) denotes the set of probability distributions defined on, and Π(P, Q) denotes the set of all couplings of P and Q, i.e., the set of all joint distributions over with marginal distributions being P and Q. We assume that φ is a strictly convex and twice-differentiable function with an L-Lipschitz continuous gradient, i.e., 0 2 φ(x) LI d, where x dom(φ), i.e., the domain of φ, and I d is an identity matrix in R d d. For the statistical learning setup, we define P r as an unknown true probability distribution, P n as the empirical distribution based on n observations from P r, and {P θ : θ R d } as a parametric family of probability distributions. 2.2 Wasserstein Divergence Definition 2.1. The Wasserstein divergence of order p between the probability distributions P and Q is defined as ( W p (P, Q) = inf π Π(P,Q) [c(x, y)] p π(dx, dy)) 1/p, (1) where p 1 and c 0 is a metric supported on. An important special case is the Wasserstein-L q divergence of order p as follows, ( Wp Lq (P, Q) = inf π Π(P,Q) x y p q π(dx, dy) ) 1/p. (2) Letting q = 2 and = R d in (2), we obtain the squared Wasserstein-L 2 divergence of order 2: W L2 2 (P, Q) = = ( inf π Π(P,Q) x 2 2 (P + Q) (dx) x y 2 2 π(dx, dy) ) 1/2 sup π Π(P,Q) 2x y π(dx, dy) Remark 1. Given P and Q, we have the following two properties of the Wasserstein divergence of order p, 1. W p (P, Q) 0 and the equality holds if and only if P = Q almost everywhere. 4

5 2. W p (P, Q) is a metric since W p (P, Q) = W p (Q, P) and where S is another probability distribution. W p (P, Q) W p (P, S) + W p (S, Q) 2.3 Bregman Divergence Definition 2.2 ([18]). The Bregman divergence with a strictly convex and differentiable function φ : R d R is denoted as for any x, y R d and D φ (P, Q) = D φ (x, y) = φ(x) φ(y) φ(y), x y (3) [φ (p(x)) φ (q(x)) φ (q(x)), p(x) q(x) ] dx, for two continuous probability distributions P and Q, where p = dp dµ absolutely continuous with respect to the Lebesgue measure µ. and q = dq dµ given that P and Q are Examples of the function φ and the resulting Bregman divergences are listed as follows, L 2 divergence: D φ (x, y) = x y 2 2 where φ(x) = x 2 2, Itakura-Saito divergence: D φ (x, y) = x y log( x y ) 1 where φ(x) = log x, KL divergence: D φ (x, y) = x log( x y ) where φ(x) = x log(x), Mahalanobis divergence: D φ (x, y) = (x y) A(x y) where φ(x) = x Ax and A 0. Remark D φ (x, y) 0, due to the convexity of φ and the equality holds true if and only if x = y. 2. D φ (x, y) is not be a metric: it is not symmetric and it violates the triangle inequality. 3. Bregman divergences are asymptotically equivalent to f-divergences (in particular, χ 2 -divergence) under some conditions [30], and are the unique class of divergences where the conditional expectation is the optimal predictor [3]. 4. In statistical learning, the Bregman divergence is extensively exploited for K-means clusterings [4]. We also provide a lemma which will be used in our analysis. Lemma 2.1. Assume that φ : R is a strictly convex and twice-differentiable function with an L-Lipschitz continuous gradient, D φ (x, y) L 2 x y 2 2 for any x, y R d. 5

6 Proof. D φ (x, y) = φ(x) φ(y) φ(y), x y = = ( 1 φ (tx + (1 t)y), x y dt φ(y), x y φ (tx + (1 t)y) φ(y), x y dt 0 = L 2 x y 2 2. ) t dt L x y 2 2 where the second equality comes from the mean value theorem and the inequality comes from the fact that φ is a twice-differentiable function with an L-Lipschitz continuous gradient. 3 Relaxed Wasserstein Divergence We now propose a new class of statistical divergence called Relaxed Wasserstein (RW) divergence, which can be seen as a combination of Bregman divergence and Wasserstein divergence. The term relaxed refers to the fact that RW divergence relaxes the symmetry of cost function c(x, y) in (1) and extends to a broader class of asymmetric divergences. Definition 3.1. The Relaxed Wasserstein divergence between the probability distributions P and Q is defined as W Dφ (P, Q) = inf D φ (x, y) π(dx, dy), π Π(P,Q) where D φ is the Bregman divergence with a strictly convex and differentiable function φ : R d R. Remark W Dφ (P, Q) 0 and the equality holds if and only if P = Q almost everywhere. 2. W Dφ (P, Q) is not a metric since D φ (x, y) is asymmetric. 3. W Dφ (P, Q) includes two important special cases, W2 L2 and W KL. More specifically, W Dφ = W2 L2 when φ(x) = x 2 2 and W D φ = W KL when φ(x) = x log(x). 3.1 Probabilistic Properties In this subsection, we establish several probabilistic properties of RW divergence. Recall that the Wasserstein divergence is controlled by weighted Total Variation (TV) distance (Theorem 6.15 [35] for more details). In parallel, we show that the RW divergence is dominated by the weighted TV distance and the squared Wasserstein-L 2 divergence. Definition 3.2. The Total Variation distance between the probability distributions P and Q is defined as where A is a Borel set. T V (P, Q) := sup P(A) Q(A), (4) A 6

7 Theorem 3.1. Assume that φ : R is a strictly convex and twice-differentiable function with an L-Lipschitz continuous gradient, then W Dφ (P, Q) L [diam( )] 2 T V (P, Q), W Dφ (P, Q) 1 [ 2 2 L W2 L2 (P, Q)], where P and Q are two probability distributions supported on a compact set R d. Proof. For the first inequality, we define π as the transfer plan that keeps all the mass shared by P and Q fixed and distributes the rest uniformly, i.e., π (dx, dy) = (P Q)(dx)δ {y=x} + 1 a (P Q) +(dx) (P Q) (dy), where P Q = P (P Q) + and a = (P Q) + [ ] = (P Q) [ ]. Then W Dφ (P, Q) D φ (x, y) π (dx, dy) = 1 [φ(x) φ(y) φ(y), x y ] (P Q) a + (dx) (P Q) (dy) = 1 [ 1 ] φ(tx + (1 t)y) φ(y), x y dt (P Q) + (dx) (P Q) a (dy) 0 1 [( 1 ) ] tdt L x y 2 2 (P Q) + (dx) (P Q) a (dy) 0 L [ ] x y 2 2 (P Q) + (dx) (P Q) 2a (dy) L [ ] x x a + x 0 y 2 2 (P Q) + (dx) (P Q) (dy) [ ] L x x (P Q) + (dx) + y x (P Q) (dy) = L x x P Q (dx) = L [diam( )] 2 P( ) Q( ) L [diam( )] 2 T V (P, Q), where the first inequality comes from Definition 3.1, the first equality comes from Definition 2.2 and the definition of the specific π, the second inequality is by Lemma 2.1, the fourth inequality comes from the triangle inequality, and the last inequality is from Definition 3.2. For the second inequality, we have W Dφ (P, Q) = inf π Π(P,Q) 0.5L inf = 0.5L π Π(P,Q) D φ (x, y) π(dx, dy) x y 2 2 π(dx, dy) [ W L2 2 (P, Q)] 2, where the inequality holds thanks to Lemma 2.1 and the fact that π(dx, dy) 0 for any coupling π Π(P, Q). This completes the proof. 7

8 Next, we establish another key probabilistic property of RW divergence, i.e., the nonasymptotic moment estimates and the concentration inequality. Our results follow from two theorems presented in [14] and Theorem 3.1. We assume that φ : R is a strictly convex and twice-differentiable function with an L-Lipschitz continuous gradient, and we further define two statistics, M q (P r ) = x q 2 P r(dx), and E α,γ (P r ) = exp (γ x α 2 ) P r (dx). Theorem 3.2 (Nonasymptotic Moment Estimate). Assume that M q (P r ) < + for some q > 2, then there exists a constant C(q, d) > 0 such that, for n 1, E [ W Dφ (P n, P r ) ] C(q, d)lm 2 q q (P r ) 2 n n q 2 q, 1 d 3, q 4, n 1 2 log(1 + n) + n q 2 q, d = 4, q 4, n 2 d + n q 2 q, d 5, q d/(d 2). Theorem 3.3 (Concentration Inequality). Assume one of the three following conditions holds, Then for n 1 and ɛ > 0, α > 2, γ > 0, E α,γ (P r ) <, (5) or α (0, 2), γ > 0, E α,γ (P r ) <, (6) or q > 4, M q (P r ) <. (7) Prob ( W Dφ (P n, P r ) ɛ ) a(n, ɛ)1 {ɛ L } + b(n, ɛ), 2 where a(n, ɛ) = C 1 ( exp ( exp ( exp ) 4cnɛ2 L 2 4cnɛ2 cn ( 2ɛ L, 1 d 3, log 2 ( 2 + L 2ɛ) ) L, d = 4, 2 ) ) d 2, d 5, and b(n, ɛ) = C 2 ( exp ( exp n ( 2nɛ L ) α ) 2 cn ( 2ɛ L 1 {ɛ> L }, under condition (5), ) 2 ( c( 2nɛ L ) α ɛ 2 1 {ɛ L 2 } + exp c ( ) α ) 2nɛ 2 L 1 {ɛ> L }, 0 < ɛ < α, under condition (6), 2 ) q ɛ 2, 0 < ɛ < q, under condition (7). where c, C 1 and C 2 are constants depending on q and d. 3.2 Continuity, Differentiability and Duality Representation In this subsection, we establish the continuity, differentiabililty and duality representation of RW divergence, demonstrating that RW divergence is a reasonable choice for the GANs. We first present a simple yet important lemma. 8

9 Lemma 3.4 (Decomposition of RW divergence). The RW divergence can be decomposed in terms of the distorted squared Wasserstein-L 2 divergence of order 2 with several additional residual terms independent of the choice of coupling π, i.e., [ W Dφ (P, Q) = W2 L2 ( P, Q ( φ) 1 )] 2 [ φ(x) 1 2 x 2 2 The relationship is also presented in Figure 1, ] P(dx) + [ φ(x), x φ(x) 12 ] φ(x) 2 Q(dx). Q ( φ) 1 Q φ W Dφ (P,Q) W L2 2 (P,Q φ) P Figure 1: The decomposition of W Dφ where the solid arrow denotes transformation and the dashed arrows denote the divergences between probability distributions. Proof. First, we need to prove that the inverse of φ is well-defined. Recall that φ : R is a strictly convex and twice-differentiable function, then we have 2 φ(x) 0, for x. That is to say, the gradient mapping φ : R d has a positive-definite Jacobian matrix at each point. Applying the mean value theorem yields that φ is injective so the inverse of φ exists and is bijective. Denote it as ( φ) 1 : φ( ), then Q ( φ) 1 : R d R is also a probability distribution. Thus W Dφ (P, Q) = inf [φ(x) φ(y) φ(y), x y ] π(dx, dy) π Π(P,Q) [ 1 = inf π Π(P,Q) 2 x ] 2 φ(y) 2 2 φ(y), x π(dx, dy) [ + φ(x) 1 ] 2 x 2 2 π(dx, dy) + [ φ(y), y φ(y) 12 ] φ(y) 2 [ 1 = inf π Π(P,Q) 2 x ] 2 φ(y) 2 2 φ(y), x π(dx, dy) [ + φ(x) 1 ] 2 x 2 2 P(dx) + [ φ(x), x φ(x) 12 ] φ(x) 2 Q(dx). π(dx, dy) 9

10 Furthermore, observe that [ ( W2 L2 P, Q ( φ) 1 )] 2 Therefore, [ W Dφ (P, Q) = This completes the proof. W2 L2 = inf x y 2 π Π(P,Q ( φ) 1 2 π(dx, dy) ) R d = inf x φ(y) 2 2 π(dx, dy) π Π(P,Q) R d ] = inf [ x π Π(P,Q) φ(y) φ(y), x π(dx, dy). ( P, Q ( φ) 1 )] 2 [ φ(x) 1 2 x 2 2 ] P(dx) + [ φ(x), x φ(x) 12 ] φ(x) 2 Q(dx). Now we are ready to present our main results on the continuity and differentiability of the parametrized RW divergence in the generative modeling. Definition 3.3 (Generative modeling). The procedure of generative modeling is to approximate the unknown probability distribution P r by constructing a class of suitable parametric probability distributions P θ. More specifically, we define a latent variable Z Z with a fixed probability distribution P Z and a sequence of parametric functions g θ : Z. Then P θ is defined as the probability distribution of g θ (Z). Theorem 3.5 (Continuity and Differentiability of RW divergence). We have the following two statements about RW divergence: 1. W Dφ (P r, P θ ) is continuous in θ if g θ is continuous in θ. 2. W Dφ (P r, P θ ) is differentiable almost everywhere if g θ is locally Lipschitz with a constant L(θ, z) such that E [ L(θ, Z) 2] <, i.e., for each given (θ 0, z 0 ), there exists a neighborhood N such that g θ (z) g θ0 (z 0 ) 2 L(θ 0, z 0 ) ( θ θ z z 0 2 ). for any (θ, z) N. Proof. It follows from Lemma 3.4 that where T 1 = 1 2 T 2 = [ W2 L2 ( Pr, P θ ( φ) 1)] 2, ] [ φ(x) 1 2 x 2 2 W Dφ (P r, P θ ) = T 1 + T 2, P r (dx) + [ φ(x), x φ(x) 12 ] φ(x) 2 P θ (dx). We observe that T 2 is continuous and differentiable with respect to θ since φ is a twice differentiable function. Furthermore, since ( φ) 1 is also continuous and differentiable, it suffices to show that W L2 2 (P r, P θ ) is 10

11 continuous in θ if g θ is continuous in θ, and differentiable almost everywhere if g θ is locally Lipschitz with a constant L(θ, z) such that E [ L(θ, Z) 2] < for any θ. Given two vectors θ 0, θ R d, we define π as a joint distribution of (g θ (Z), g θ0 (Z)) where Z P Z, then where W L2 2 (P θ, P θ0 ) = ( ( Z x y 2 2 π(dx, dy) ) 1/2 g θ (z) g θ0 (z) 2 2 P Z(dz)) 1/2, g θ (z) g θ0 (z) 2 2 0, z Z, since g θ is continuous in θ. Furthermore, g θ1 (z) g θ2 (z) 2 2 is uniformly bounded on Z since g θ(x) and is a compact set. Therefore, applying the bounded convergence theorem yields W2 L2 (P r, P θ ) W2 L2 (P r, P θ0 ) W2 L2 (P θ, P θ0 ) ( g θ (z) g θ0 (z) 2 2 P Z(dz) Z 0, as θ θ 0. where the first inequality comes from the triangle inequality. Given a pair (θ 0, z 0 ), the local Lipschitz continuity of g θ implies that there exists a neighborhood N such that g θ (z) g θ0 (z 0 ) 2 L(θ 0, z 0 ) ( θ θ z z 0 2 ). ) 1/2 for any (θ, z) N. Then g θ (z 0 ) g θ0 (z 0 ) 2 2 P Z(dz 0 ) Z Z [L(θ 0, z 0 )] 2 θ θ P Z (dz 0 ) = θ θ E [ L(θ 0, Z) 2]. Therefore, W2 L2 (P r, P θ ) W2 L2 (P r, P θ0 ) W2 L2 (P θ, P θ0 ) ( g θ (z 0 ) g θ0 (z 0 ) 2 2 P Z(dz 0 ) Z θ θ 0 2 E [ L(θ, Z) 2] 1/2, ) 1/2 which implies that W L2 2 (P r, P θ ) is locally Lipschitz. Applying the Rademacher s theorem [13] yields that W L2 2 (P r, P θ ) is differentiable with respect to θ almost everywhere. This completes the proof. Next, we turn to present our results on the duality representation of RW divergence. Theorem 3.6 (Duality Representation of RW divergence). Assume that two probability distributions P and Q such that x 2 2 (P + Q) (dx) < +. 11

12 Then there exists a Lipschitz continuous function f : R such that the RW divergence has a duality representation as W Dφ (P, Q) = ( ) φ(x) (P Q) (dx) + φ(x), x Q(dx) f(x) P(dx) + f ( φ(x)) Q(dx), where f is the conjugate of f, i.e., f (y) = sup x R d x, y f(x). Proof. Given two probability distributions P and Q that satisfy x 2 2 (P + Q) (dx) < +, it follows from Proposition 3.1 [6] that there exists a Lipschitz continuous function f : R such that the squared Wasserstein-L 2 divergence of order 2 has a duality representation, i.e., [ 2 W2 L2 (P, Q)] = inf x y 2 2 π(dx, dy) π Π(P,Q) ( ) = x 2 2 (P + Q) (dx) 2 f(x) P(dx) + f (x) Q(dx), where f is the convex conjugate of f, i.e., f (y) = sup x R d x, y f(x). Combining Lemma 3.4 yields that W Dφ (P, Q) = 1 [ ( W2 L2 P, Q ( φ) 1 )] 2 2 [ + φ(x) 1 ] 2 x 2 2 P(dx) + [ φ(x), x φ(x) 12 ] φ(x) 2 Q(dx), = 1 ( ) ( ) x P(dx) + φ(x) 2 2 Q(dx) f(x) P(dx) + f ( φ(x)) Q(dx) [ + φ(x) 1 ] 2 x 2 2 P(dx) + [ φ(x), x φ(x) 12 ] φ(x) 2 Q(dx) ( ) = φ(x) (P Q) (dx) + φ(x), x Q(dx) f(x) P(dx) + f ( φ(x)) Q(dx). This completes the proof. Finally, we show that Theorem 3.6 allows for an explicit formula for the gradient evaluation in the generative modeling (Definition 3.3), providing the theoretical guarantee for the RWGANs training. Corollary (Gradient Evaluation). Under the setting of generative modeling, we assume that g θ is locally Lipschitz with a constant L(θ, z) such that E [ L(θ, Z) 2] < and x 2 2 (P r + P θ ) (dx) < +. Then there exists a Lipschitz continuous solution f : R such that the gradient of the RW divergence has an explicit form, i.e., [ θ WDφ (P r, P θ ) ] ] = E Z [[ θ g θ (Z)] 2 φ(g θ (Z))g θ (Z) + E Z [ θ f ( φ(g θ (Z)))]. 12

13 Proof. Under the conditions that g θ is a locally Lipschitz and x 2 2 (P r + P θ ) (dx) < +, it follows from Theorem 3.5 and Theorem 3.6 that W Dφ (P r, P θ ) is differentiable almost everywhere and there exists a Lipschitz continuous function f : R such that the RW divergence has a duality representation as W Dφ (P r, P θ ) = ( φ(x) (P r P θ ) (dx)+ φ(x), x P θ (dx) f(x) P r (dx) + ) f ( φ(x)) P θ (dx). By the envelope theorem [26], we obtain that [ θ WDφ (P r, P θ ) ] ] = θ [ φ(x) P θ (dx) + φ(x), x P θ (dx) f ( φ(x)) P θ (dx) ] = θ [ φ(g θ (z)) P Z (dz) + φ(g θ (z)), g θ (z) P Z (dz) f ( φ(g θ (z))) P Z (dz) Z Z Z = [ θ g θ (z)] φ(g θ (z)) P Z (dz) + [ θ g θ (z)] φ(g θ (z)) P Z (dz) Z Z + [ θ g θ (z)] 2 φ(g θ (z))g θ (z) P Z (dz) θ f ( φ(g θ (z))) P Z (dz) Z Z = [ θ g θ (z)] 2 φ(g θ (z))g θ (z) P Z (dz) θ f ( φ(g θ (z))) P Z (dz) Z Letting f = f, we conclude that [ θ WDφ (P r, P θ ) ] ] = E Z [[ θ g θ (Z)] 2 φ(g θ (Z))g θ (Z) + E Z [ θ f ( φ(g θ (Z)))], where f is Lipschitz continuous. This completes the proof. Z 4 Empirical Results In this section, we will present numerical evaluation on image generations to demonstrate the effectiveness and efficiency of using RW in GANs. We will first describe our approach of RW in GANs (RWGANs) and the experimental setting (Section 4.1), and then report the experimental results under RWGANs and other nine well-established variants of GANs (Section 4.2). 4.1 Experimental Approach and Setting RWGANs approach. The goal of the GANs is to estimate a probability distribution P r hidden in the data. As defined in Definition 3.3, one can define a random variable Z with a fixed distribution P Z and pass it through a parametric function g θ : Z to construct a probability distribution P θ. In this light, one can learn the probability distribution P r by adapting θ and fitting the data with P θ. This approximation is done by finding a solution f that optimizes a given cost function between P r and P θ. Despite the theoretical explicit formulas derived in the duality representation and the gradient evaluation (Theorem 3.6 and Corollary 3.6.1), it is infeasible to directly compute such an f in practice. Nevertheless, 13

14 since the Wasserstein divergence is parametrized by any convex function in RWGANs, it provides a great deal of flexibility in the choice of loss functions. For example, one can choose an appropriate φ such that [ θ WDφ (P r, P θ ) ] E Z [ θ f ( φ(g θ (z)))]. In our experiment, we try the KL divergence where 2 φ(x) = diag(1/x) since we observe that ] [ ] E Z [[ θ g θ (Z)] 2 φ(g θ (Z))g θ (Z) = E Z [ θ g θ (Z)] 1 C, where C is a constant depending on the Lipschitz constant of g θ. This implies that this term is controlled by θ during the process of training. The numerical results confirm the effectiveness of our heuristic in practice. Experimental framework. Our experimental framework is similar to the one in WGANs [2] in that a) we apply back-propagation to train the generator and discriminator networks, and b) we update the parameters once in the generative model and n critic times in the discriminator network. Our framework differs from the WGANs [2] in several aspects. First, we use φ to do the asymmetric clipping instead of the symmetric clipping. Note that the asymmetric clipping guarantees the Lipschitz continuity of f and φ(w) [ c, c]. Second, we use a scaling parameter S to stabilize the asymmetric clipping. This is critical for the experiment. Finally, we adopt RMSProp [34] instead of ADAM [19], which allows a choice of a larger step-size and avoids the non-stationary problem [28]. We describe our method with default parameters in Algorithm 1. Algorithm 1 RWGANs. The default values α = , c = 0.005, S = 0.01, m = 64, n critic = 5. Require: α: the learning rate; c: the clipping parameter; m: the batch size; n critic, the number of iterations of the critic per generator iteration; N max, the maximum number of one forward pass and one backward pass of all the training examples. Require: w 0, initial critic parameters; θ 0 : initial generator s parameters. for N = 1, 2,..., N max do for t = 0,..., n critic do Sample a batch of real data {x i } m i=1 from P r. Sample a batch of prior samples {z i } m i=1 from p(z). g w 1 m m i=1 [ wf w (x i ) w f w (g θ (z i ))]. w w + α RMSProp(w, g w ). w clip ( w, S ( φ) 1 ( c), S ( φ) 1 (c) ). end for Sample a batch of prior samples {z i } m i=1 from p(z). g θ 1 m m i=1 θf w ( φ(g θ (z i ))). θ θ α RMSProp(θ, g θ ). end for 14

15 Experimental setting. In order to test RWGANs, we adopt nine baseline methods as discussed in the introduction. They are RWGANs, WGANs [2], WGANs-GP [17], CGANs [27], InfoGANs [10], GANs [16], LSGANs [24], DRAGANs [20], BEGANs [5], EBGANs [38], and ACGANs [29]. The implementation of all these approaches is based on publicly available online information 1. In addition, we use the following four standard and well-known datasets in our experiment. 1. MNIST 2 is a dataset of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. 2. Fashion-MNIST 3 is an alternative dataset of Zalando s article images to MNIST. It consists of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a gray-scale image, associated with a label from 10 classes. 3. The CIFAR-10 4 dataset consists of color images in 10 classes, with 6000 images per class. There are training images and test images. 4. The ImageNet 5 dataset is a large visual database designed for visual object recognition research. As of 2016, over ten million URLs of images have been hand-annotated by ImageNet to indicate which objects are in the picture. In at least one million of the images, bounding boxes are also provided. Metric. The negative critic loss, well-known as the standard quantitative metric, is used in all our experiments. In addition to the negative critic loss, we use the inception score [33] to evaluate samples generated by three WGANs methods on CIFAR-10 and ImageNet. The inception score is defined as follows: Inception_Score = exp {E x [D KL (p(y x), p(y)]}, where p(y x) is given by the inception network. A high inception score is an indicator that the images generated by the model are highly interpretable and diversified. It is also highly correlated with human evaluation of the images. 4.2 Experimental Result Experiments on MNIST and Fashion-MNIST: We start our experiment by training models using the ten different GANs procedures on MNIST and Fashion-MNIST. The architecture is DCGAN [31] and the maximum number of epochs is 100. Figure 3 shows the training curves of the negative critic loss of all candidate approaches. The figure indicates that RWGANs and WGANs are stable with the smallest variances, where RWGANs has a slight kriz/cifar.html x64.tar 15

16 higher variance partly due to the use of a larger step-size and asymmetric clipping. This slightly higher variance, nevertheless, speeds up the rate of training. Indeed, as illustrated in Figure 4 and Figure 5, RWGANs is the fastest to generate meaningful images. Note that CGANs and InfoGANs seem faster but the images they have generated are not meaningful, as they fall into bad local optima from an optimization perspective. Experiments on CIFAR-10 and ImageNet: After observing that WGANs and RWGANs perform the best among all the variants of GANs, we proceed to compare RWGANs and WGANs, together with WGANs with Gradient Penalty (WGANs-GP), on two much larger datasets CIFAR-10 and ImageNet. Here the architectures used are DCGAN and ReLU-MLP [11] and the maximum number of epochs is set to 25. Figure 6 shows the training curves of the negative critic loss of all candidate approaches again. Except for the small variance of WGANs, we observe that, in terms of the negative loss, WGANs-GP tend to diverge as the training progresses, implying that such method might not be robust in practice despite its fast rate of training. In this case, RWGANs achieve relatively low variance with convergent negative critic loss, leading to a trade-off between robustness and efficiency. We then evaluate the candidate methods with the inception score and present the results in Table 2. The table shows that RWGANs are often the fastest method. They perform the best in three out of four cases during several early epochs, and obtain images with competitively high quality at the final stage. Figures 7, Figure 8, Figure 9 and Figure 10 show the sample qualities of the image generated at the initial and final stages, which strongly supports our conclusion. Architecture DCGAN MLP Method CIFAR-10 ImageNet First 5 epochs Last 10 epochs First 3 epochs Last 5 epochs RWGANs WGANs WGANs-GP RWGANs WGANs WGANs-GP Figure 2: Inception scores at the beginning and final stages of training. DCGAN refers to the standard DCGAN generator and MLP refers to an ReLU-MLP with 4 hidden layers and 512 units at each layer. 5 Conclusion We propose a novel class of statistical divergence called RW divergence and establish several important theoretical properties, crucial for the GANs training. The experiments, with RW parametrized by the KL divergence in image generation, show that RWGANs is a promising trade-off between WGANs and WGANs-GP, achieving both the robustness and efficiency during the learning process. The asymmetric clipping in RWGANs is a viable alternative to the gradient penalty and the symmetric clipping in WGANs, 16

17 avoiding the low-quality samples and the failure of convergence. The flexible framework of RW divergences opens a door to the implementation and comparison of different loss functions for GANs. Meanwhile it raises a natural question on whether one can select φ according to the statistics of data and the structure of the problem. References [1] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. Ariv Preprint: , [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. ICML, pages , [3] A. Banerjee,. Guo, and H. Wang. On the optimality of conditional expectation as a Bregman predictor. IEEE Transactions on Information Theory, 51(7): , [4] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. Journal of Machine Learning Research, 6(Oct): , [5] D. Berthelot, T. Schumm, and L. Metz. BEGAN: Boundary equilibrium generative adversarial networks. Ariv Preprint: , [6] Y. Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics, 44(4): , [7] L. Caffarelli. Some regularity properties of solutions of Monge Ampère equation. Communications on Pure and Applied Mathematics, 44(8-9): , [8] L. Caffarelli. The regularity of mappings with a convex potential. Journal of the American Mathematical Society, 5(1):99 104, [9] S. Chen and A. Figalli. Partial W 2,p regularity for optimal transport maps. Journal of Functional Analysis, 272(11): , [10]. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. NIPS, pages , [11] B. Conan-Guez and F. Rossi. Multi-Layer Perceptrons for functional data analysis: a projection based approach. Artificial Neural Networks ICANN 2002, [12] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a Laplacian pyramid of adversarial networks. NIPS, pages , [13] L. Evans and R. Gariepy. Measure Theory and Fine Properties of Functions. CRC Press,

18 [14] N. Fournier and A. Guillin. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3-4): , [15] A. Ghosh, V. Kulharia, A. Mukerjee, V. Namboodiri, and M. Bansal. Contextual RNN-GANs for abstract reasoning diagram generation. AAAI, pages , [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. u, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. NIPS, pages , [17] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. Ariv Preprint: , [18] L. K. Jones and C. L. Byrne. General entropy criteria for inverse problems, with applications to data compression, pattern classification, and cluster analysis. IEEE Transactions on Information Theory, [19] D. Kingma and J. Ba. Adam: A method for stochastic optimization. Ariv Preprint: , [20] N. Kodali, J. Abernethy, J. Hays, and Z. Kira. How to train your DRAGAN. Ariv Preprint: , [21] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. CoRR, abs/ , [22] J. Li, A. Madry, J. Peebles, and L. Schmidt. Towards understanding the dynamics of generative adversarial networks. Ariv Preprint: , [23] P. Luc, C. Couprie, S. Chintala, and J. Verbeek. Semantic segmentation using adversarial networks. In NIPS Workshop on Adversarial Training, [24]. Mao, Q. Li, H. ie, R. Y. K. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. Ariv Preprint: , [25] L. Mescheder, S. Nowozin, and A. Geiger. Adversarial variational Bayes: Unifying variational autoencoders and generative adversarial networks. In ICML, pages , [26] P. Milgrom and I. Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2): , [27] M. Mirza and S. Osindero. Conditional generative adversarial nets. Ariv Preprint: , [28] V. Mnih, A. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. ICML, pages , [29] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier GANs. ICML, pages ,

19 [30] M. C. Pardo and I. Vajda. On asymptotic properties of information-theoretic divergences. IEEE Transactions on Information Theory, 49(7): , [31] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. Arxiv Preprint: , [32] S. Reed, Z. Akata,. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text-to-image synthesis. ICML, pages , [33] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and. Chen. Improved techniques for training GANs. NIPS, pages , [34] T. Tieleman and G. Hinton. Lecture 6.5-RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2), [35] C. Villani. Optimal Transport: Old and New, volume 338. Springer Science & Business Media, [36] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. NIPS, pages , [37] R. Yeh, C. Chen, T. Y. Lim, M. Hasegawa-Johnson, and M. N. Do. Semantic image inpainting with perceptual and contextual losses. Ariv Preprint: , [38] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. Ariv Preprint: , [39] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In ECCV, pages ,

20 Method MNIST Fashion-MNIST Method RWGANs LSGANs WGANs DRAGANs CGANs BEGANs InfoGANs EBGANs GANs ACGANs MNIST Fashion-MNIST Figure 3: Training curves of the negative critic loss at different stages of training on MNIST and FashionMNIST. Gloss and Dloss refer to the loss in generative and discriminative nets, which is plotted in orange and blue lines, respectively. 20

21 Method N = 1 N = 10 N = 25 N = 100 RWGANs WGANs CGANs InfoGANs GANs 21

22 Method N = 1 N = 10 N = 25 N = 100 LSGANs DRAGANs BEGANs EBGANs ACGANs Figure 4: Sample qualities at different stages of training on MNIST. 22

23 Method N = 1 N = 10 N = 25 N = 100 RWGANs WGANs CGANs InfoGANs GANs 23

24 Method N = 1 N = 10 N = 25 N = 100 LSGANs DRAGANs BEGANs EBGANs ACGANs Figure 5: Sample qualities at different stages of training on Fashion-MNIST. 24

25 Method Architecture CIFAR-10 ImageNet RWGANs DCGAN MLP WGANs DCGAN MLP WGANs-GP DCGAN MLP Figure 6: Training curves at different stages of training. DCGAN refers to the standard DCGAN generator and MLP refers to an ReLU-MLP with 4 hidden layers and 512 units at each layer. G loss and D loss refer to the loss in generative and discriminative nets. The loss in RWGANs is shown to converge consistently while the loss in WGANs-GP tends to diverge as the training progresses. WGANs achieves the lowest variance among the three methods. 25

26 Method DCGAN N = 1 MLP RWGANs WGANs WGANs- GP Figure 7: Sample qualities at the initial stage of training on CIFAR

27 Method DCGAN N = 100 MLP RWGANs WGANs WGANs- GP Figure 8: Sample qualities at the final stage of training on CIFAR

28 Method DCGAN N = 1 MLP RWGANs WGANs WGANs- GP Figure 9: Sample qualities at the initial stage of training on ImageNet. 28

29 Method DCGAN N = 25 MLP RWGANs WGANs WGANs- GP Figure 10: Sample qualities at the final stage of training on ImageNet. 29

Lecture 14: Deep Generative Learning

Lecture 14: Deep Generative Learning Generative Modeling CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 14: Deep Generative Learning Density estimation Reconstructing probability density function using samples Bohyung Han

More information

Generative Adversarial Networks

Generative Adversarial Networks Generative Adversarial Networks SIBGRAPI 2017 Tutorial Everything you wanted to know about Deep Learning for Computer Vision but were afraid to ask Presentation content inspired by Ian Goodfellow s tutorial

More information

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Nishant Gurnani. GAN Reading Group. April 14th, / 107 Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107 Why are these Papers Important? 2 / 107 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN,

More information

Singing Voice Separation using Generative Adversarial Networks

Singing Voice Separation using Generative Adversarial Networks Singing Voice Separation using Generative Adversarial Networks Hyeong-seok Choi, Kyogu Lee Music and Audio Research Group Graduate School of Convergence Science and Technology Seoul National University

More information

arxiv: v1 [cs.lg] 20 Apr 2017

arxiv: v1 [cs.lg] 20 Apr 2017 Softmax GAN Min Lin Qihoo 360 Technology co. ltd Beijing, China, 0087 mavenlin@gmail.com arxiv:704.069v [cs.lg] 0 Apr 07 Abstract Softmax GAN is a novel variant of Generative Adversarial Network (GAN).

More information

Negative Momentum for Improved Game Dynamics

Negative Momentum for Improved Game Dynamics Negative Momentum for Improved Game Dynamics Gauthier Gidel Reyhane Askari Hemmat Mohammad Pezeshki Gabriel Huang Rémi Lepriol Simon Lacoste-Julien Ioannis Mitliagkas Mila & DIRO, Université de Montréal

More information

GENERATIVE ADVERSARIAL LEARNING

GENERATIVE ADVERSARIAL LEARNING GENERATIVE ADVERSARIAL LEARNING OF MARKOV CHAINS Jiaming Song, Shengjia Zhao & Stefano Ermon Computer Science Department Stanford University {tsong,zhaosj12,ermon}@cs.stanford.edu ABSTRACT We investigate

More information

Wasserstein GAN. Juho Lee. Jan 23, 2017

Wasserstein GAN. Juho Lee. Jan 23, 2017 Wasserstein GAN Juho Lee Jan 23, 2017 Wasserstein GAN (WGAN) Arxiv submission Martin Arjovsky, Soumith Chintala, and Léon Bottou A new GAN model minimizing the Earth-Mover s distance (Wasserstein-1 distance)

More information

Training Generative Adversarial Networks Via Turing Test

Training Generative Adversarial Networks Via Turing Test raining enerative Adversarial Networks Via uring est Jianlin Su School of Mathematics Sun Yat-sen University uangdong, China bojone@spaces.ac.cn Abstract In this article, we introduce a new mode for training

More information

arxiv: v3 [stat.ml] 20 Feb 2018

arxiv: v3 [stat.ml] 20 Feb 2018 MANY PATHS TO EQUILIBRIUM: GANS DO NOT NEED TO DECREASE A DIVERGENCE AT EVERY STEP William Fedus 1, Mihaela Rosca 2, Balaji Lakshminarayanan 2, Andrew M. Dai 1, Shakir Mohamed 2 and Ian Goodfellow 1 1

More information

Which Training Methods for GANs do actually Converge?

Which Training Methods for GANs do actually Converge? Lars Mescheder 1 Andreas Geiger 1 2 Sebastian Nowozin 3 Abstract Recent work has shown local convergence of GAN training for absolutely continuous data and generator distributions. In this paper, we show

More information

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS Dan Hendrycks University of Chicago dan@ttic.edu Steven Basart University of Chicago xksteven@uchicago.edu ABSTRACT We introduce a

More information

Some theoretical properties of GANs. Gérard Biau Toulouse, September 2018

Some theoretical properties of GANs. Gérard Biau Toulouse, September 2018 Some theoretical properties of GANs Gérard Biau Toulouse, September 2018 Coauthors Benoît Cadre (ENS Rennes) Maxime Sangnier (Sorbonne University) Ugo Tanielian (Sorbonne University & Criteo) 1 video Source:

More information

SOLVING LINEAR INVERSE PROBLEMS USING GAN PRIORS: AN ALGORITHM WITH PROVABLE GUARANTEES. Viraj Shah and Chinmay Hegde

SOLVING LINEAR INVERSE PROBLEMS USING GAN PRIORS: AN ALGORITHM WITH PROVABLE GUARANTEES. Viraj Shah and Chinmay Hegde SOLVING LINEAR INVERSE PROBLEMS USING GAN PRIORS: AN ALGORITHM WITH PROVABLE GUARANTEES Viraj Shah and Chinmay Hegde ECpE Department, Iowa State University, Ames, IA, 5000 In this paper, we propose and

More information

Supplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization

Supplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization Supplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization Sebastian Nowozin, Botond Cseke, Ryota Tomioka Machine Intelligence and Perception Group

More information

arxiv: v3 [cs.lg] 11 Jun 2018

arxiv: v3 [cs.lg] 11 Jun 2018 Lars Mescheder 1 Andreas Geiger 1 2 Sebastian Nowozin 3 arxiv:1801.04406v3 [cs.lg] 11 Jun 2018 Abstract Recent work has shown local convergence of GAN training for absolutely continuous data and generator

More information

Multiplicative Noise Channel in Generative Adversarial Networks

Multiplicative Noise Channel in Generative Adversarial Networks Multiplicative Noise Channel in Generative Adversarial Networks Xinhan Di Deepearthgo Deepearthgo@gmail.com Pengqian Yu National University of Singapore yupengqian@u.nus.edu Abstract Additive Gaussian

More information

MMD GAN 1 Fisher GAN 2

MMD GAN 1 Fisher GAN 2 MMD GAN 1 Fisher GAN 1 Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos (CMU, IBM Research) Youssef Mroueh, and Tom Sercu (IBM Research) Presented by Rui-Yi(Roy) Zhang Decemeber

More information

Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p

More information

arxiv: v1 [cs.lg] 8 Dec 2016

arxiv: v1 [cs.lg] 8 Dec 2016 Improved generator objectives for GANs Ben Poole Stanford University poole@cs.stanford.edu Alexander A. Alemi, Jascha Sohl-Dickstein, Anelia Angelova Google Brain {alemi, jaschasd, anelia}@google.com arxiv:1612.02780v1

More information

ON ADVERSARIAL TRAINING AND LOSS FUNCTIONS FOR SPEECH ENHANCEMENT. Ashutosh Pandey 1 and Deliang Wang 1,2. {pandey.99, wang.5664,

ON ADVERSARIAL TRAINING AND LOSS FUNCTIONS FOR SPEECH ENHANCEMENT. Ashutosh Pandey 1 and Deliang Wang 1,2. {pandey.99, wang.5664, ON ADVERSARIAL TRAINING AND LOSS FUNCTIONS FOR SPEECH ENHANCEMENT Ashutosh Pandey and Deliang Wang,2 Department of Computer Science and Engineering, The Ohio State University, USA 2 Center for Cognitive

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

Deep Generative Models. (Unsupervised Learning)

Deep Generative Models. (Unsupervised Learning) Deep Generative Models (Unsupervised Learning) CEng 783 Deep Learning Fall 2017 Emre Akbaş Reminders Next week: project progress demos in class Describe your problem/goal What you have done so far What

More information

Energy-Based Generative Adversarial Network

Energy-Based Generative Adversarial Network Energy-Based Generative Adversarial Network Energy-Based Generative Adversarial Network J. Zhao, M. Mathieu and Y. LeCun Learning to Draw Samples: With Application to Amoritized MLE for Generalized Adversarial

More information

arxiv: v1 [eess.iv] 28 May 2018

arxiv: v1 [eess.iv] 28 May 2018 Versatile Auxiliary Regressor with Generative Adversarial network (VAR+GAN) arxiv:1805.10864v1 [eess.iv] 28 May 2018 Abstract Shabab Bazrafkan, Peter Corcoran National University of Ireland Galway Being

More information

Understanding GANs: Back to the basics

Understanding GANs: Back to the basics Understanding GANs: Back to the basics David Tse Stanford University Princeton University May 15, 2018 Joint work with Soheil Feizi, Farzan Farnia, Tony Ginart, Changho Suh and Fei Xia. GANs at NIPS 2017

More information

Nonparametric Inference for Auto-Encoding Variational Bayes

Nonparametric Inference for Auto-Encoding Variational Bayes Nonparametric Inference for Auto-Encoding Variational Bayes Erik Bodin * Iman Malik * Carl Henrik Ek * Neill D. F. Campbell * University of Bristol University of Bath Variational approximations are an

More information

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab, Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab, 2016-08-31 Generative Modeling Density estimation Sample generation

More information

Bregman Divergences for Data Mining Meta-Algorithms

Bregman Divergences for Data Mining Meta-Algorithms p.1/?? Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin ghosh@ece.utexas.edu Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon,

More information

arxiv: v4 [cs.cv] 5 Sep 2018

arxiv: v4 [cs.cv] 5 Sep 2018 Wasserstein Divergence for GANs Jiqing Wu 1, Zhiwu Huang 1, Janine Thoma 1, Dinesh Acharya 1, and Luc Van Gool 1,2 arxiv:1712.01026v4 [cs.cv] 5 Sep 2018 1 Computer Vision Lab, ETH Zurich, Switzerland {jwu,zhiwu.huang,jthoma,vangool}@vision.ee.ethz.ch,

More information

A Unified View of Deep Generative Models

A Unified View of Deep Generative Models SAILING LAB Laboratory for Statistical Artificial InteLigence & INtegreative Genomics A Unified View of Deep Generative Models Zhiting Hu and Eric Xing Petuum Inc. Carnegie Mellon University 1 Deep generative

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks

Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks Emily Denton 1, Soumith Chintala 2, Arthur Szlam 2, Rob Fergus 2 1 New York University 2 Facebook AI Research Denotes equal

More information

topics about f-divergence

topics about f-divergence topics about f-divergence Presented by Liqun Chen Mar 16th, 2018 1 Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments

More information

Distirbutional robustness, regularizing variance, and adversaries

Distirbutional robustness, regularizing variance, and adversaries Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

AdaGAN: Boosting Generative Models

AdaGAN: Boosting Generative Models AdaGAN: Boosting Generative Models Ilya Tolstikhin ilya@tuebingen.mpg.de joint work with Gelly 2, Bousquet 2, Simon-Gabriel 1, Schölkopf 1 1 MPI for Intelligent Systems 2 Google Brain Radford et al., 2015)

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Generative Adversarial Networks, and Applications

Generative Adversarial Networks, and Applications Generative Adversarial Networks, and Applications Ali Mirzaei Nimish Srivastava Kwonjoon Lee Songting Xu CSE 252C 4/12/17 2/44 Outline: Generative Models vs Discriminative Models (Background) Generative

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Learning to Sample Using Stein Discrepancy

Learning to Sample Using Stein Discrepancy Learning to Sample Using Stein Discrepancy Dilin Wang Yihao Feng Qiang Liu Department of Computer Science Dartmouth College Hanover, NH 03755 {dilin.wang.gr, yihao.feng.gr, qiang.liu}@dartmouth.edu Abstract

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Predicting Deeper into the Future of Semantic Segmentation Supplementary Material

Predicting Deeper into the Future of Semantic Segmentation Supplementary Material Predicting Deeper into the Future of Semantic Segmentation Supplementary Material Pauline Luc 1,2 Natalia Neverova 1 Camille Couprie 1 Jakob Verbeek 2 Yann LeCun 1,3 1 Facebook AI Research 2 Inria Grenoble,

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,

More information

Open Set Learning with Counterfactual Images

Open Set Learning with Counterfactual Images Open Set Learning with Counterfactual Images Lawrence Neal, Matthew Olson, Xiaoli Fern, Weng-Keen Wong, Fuxin Li Collaborative Robotics and Intelligent Systems Institute Oregon State University Abstract.

More information

arxiv: v7 [cs.lg] 27 Jul 2018

arxiv: v7 [cs.lg] 27 Jul 2018 How Generative Adversarial Networks and Their Variants Work: An Overview of GAN Yongjun Hong, Uiwon Hwang, Jaeyoon Yoo and Sungroh Yoon Department of Electrical & Computer Engineering Seoul National University,

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Composite Functional Gradient Learning of Generative Adversarial Models. Appendix

Composite Functional Gradient Learning of Generative Adversarial Models. Appendix A. Main theorem and its proof Appendix Theorem A.1 below, our main theorem, analyzes the extended KL-divergence for some β (0.5, 1] defined as follows: L β (p) := (βp (x) + (1 β)p(x)) ln βp (x) + (1 β)p(x)

More information

First Order Generative Adversarial Networks

First Order Generative Adversarial Networks Calvin Seward 1 2 Thomas Unterthiner 2 Urs Bergmann 1 Nikolay Jetchev 1 Sepp Hochreiter 2 Abstract GANs excel at learning high dimensional distributions, but they can update generator parameters in directions

More information

Understanding GANs: the LQG Setting

Understanding GANs: the LQG Setting Understanding GANs: the LQG Setting Soheil Feizi 1, Changho Suh 2, Fei Xia 1 and David Tse 1 1 Stanford University 2 Korea Advanced Institute of Science and Technology arxiv:1710.10793v1 [stat.ml] 30 Oct

More information

CONTINUOUS-TIME FLOWS FOR EFFICIENT INFER-

CONTINUOUS-TIME FLOWS FOR EFFICIENT INFER- CONTINUOUS-TIME FLOWS FOR EFFICIENT INFER- ENCE AND DENSITY ESTIMATION Anonymous authors Paper under double-blind review ABSTRACT Two fundamental problems in unsupervised learning are efficient inference

More information

Machine Learning Summer 2018 Exercise Sheet 4

Machine Learning Summer 2018 Exercise Sheet 4 Ludwig-Maimilians-Universitaet Muenchen 17.05.2018 Institute for Informatics Prof. Dr. Volker Tresp Julian Busch Christian Frey Machine Learning Summer 2018 Eercise Sheet 4 Eercise 4-1 The Simpsons Characters

More information

Differentiable Fine-grained Quantization for Deep Neural Network Compression

Differentiable Fine-grained Quantization for Deep Neural Network Compression Differentiable Fine-grained Quantization for Deep Neural Network Compression Hsin-Pai Cheng hc218@duke.edu Yuanjun Huang University of Science and Technology of China Anhui, China yjhuang@mail.ustc.edu.cn

More information

Generative models for missing value completion

Generative models for missing value completion Generative models for missing value completion Kousuke Ariga Department of Computer Science and Engineering University of Washington Seattle, WA 98105 koar8470@cs.washington.edu Abstract Deep generative

More information

Deep Learning Year in Review 2016: Computer Vision Perspective

Deep Learning Year in Review 2016: Computer Vision Perspective Deep Learning Year in Review 2016: Computer Vision Perspective Alex Kalinin, PhD Candidate Bioinformatics @ UMich alxndrkalinin@gmail.com @alxndrkalinin Architectures Summary of CNN architecture development

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

Variational Autoencoder

Variational Autoencoder Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational

More information

arxiv: v3 [cs.lg] 25 Dec 2017

arxiv: v3 [cs.lg] 25 Dec 2017 Improved Training of Wasserstein GANs arxiv:1704.00028v3 [cs.lg] 25 Dec 2017 Ishaan Gulrajani 1, Faruk Ahmed 1, Martin Arjovsky 2, Vincent Dumoulin 1, Aaron Courville 1,3 1 Montreal Institute for Learning

More information

Notes on Adversarial Examples

Notes on Adversarial Examples Notes on Adversarial Examples David Meyer dmm@{1-4-5.net,uoregon.edu,...} March 14, 2017 1 Introduction The surprising discovery of adversarial examples by Szegedy et al. [6] has led to new ways of thinking

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

f-gan: Training Generative Neural Samplers using Variational Divergence Minimization

f-gan: Training Generative Neural Samplers using Variational Divergence Minimization f-gan: Training Generative Neural Samplers using Variational Divergence Minimization Sebastian Nowozin, Botond Cseke, Ryota Tomioka Machine Intelligence and Perception Group Microsoft Research {Sebastian.Nowozin,

More information

GANs, GANs everywhere

GANs, GANs everywhere GANs, GANs everywhere particularly, in High Energy Physics Maxim Borisyak Yandex, NRU Higher School of Economics Generative Generative models Given samples of a random variable X find X such as: P X P

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Summary of A Few Recent Papers about Discrete Generative models

Summary of A Few Recent Papers about Discrete Generative models Summary of A Few Recent Papers about Discrete Generative models Presenter: Ji Gao Department of Computer Science, University of Virginia https://qdata.github.io/deep2read/ Outline SeqGAN BGAN: Boundary

More information

Wasserstein Generative Adversarial Networks

Wasserstein Generative Adversarial Networks Martin Arjovsky 1 Soumith Chintala 2 Léon Bottou 1 2 Abstract We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability

More information

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

arxiv: v3 [stat.ml] 15 Oct 2017

arxiv: v3 [stat.ml] 15 Oct 2017 Non-parametric estimation of Jensen-Shannon Divergence in Generative Adversarial Network training arxiv:1705.09199v3 [stat.ml] 15 Oct 2017 Mathieu Sinn IBM Research Ireland Mulhuddart, Dublin 15, Ireland

More information

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs

More information

Improved Training of Wasserstein GANs

Improved Training of Wasserstein GANs Improved Training of Wasserstein GANs Ishaan Gulrajani 1, Faruk Ahmed 1, Martin Arjovsky 2, Vincent Dumoulin 1, Aaron Courville 1,3 1 Montreal Institute for Learning Algorithms 2 Courant Institute of Mathematical

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

arxiv: v2 [cs.cl] 1 Jan 2019

arxiv: v2 [cs.cl] 1 Jan 2019 Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2

More information

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Mathias Berglund, Tapani Raiko, and KyungHyun Cho Department of Information and Computer Science Aalto University

More information

Generative adversarial networks

Generative adversarial networks 14-1: Generative adversarial networks Prof. J.C. Kao, UCLA Generative adversarial networks Why GANs? GAN intuition GAN equilibrium GAN implementation Practical considerations Much of these notes are based

More information

Encoder Based Lifelong Learning - Supplementary materials

Encoder Based Lifelong Learning - Supplementary materials Encoder Based Lifelong Learning - Supplementary materials Amal Rannen Rahaf Aljundi Mathew B. Blaschko Tinne Tuytelaars KU Leuven KU Leuven, ESAT-PSI, IMEC, Belgium firstname.lastname@esat.kuleuven.be

More information

The Success of Deep Generative Models

The Success of Deep Generative Models The Success of Deep Generative Models Jakub Tomczak AMLAB, University of Amsterdam CERN, 2018 What is AI about? What is AI about? Decision making: What is AI about? Decision making: new data High probability

More information

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence

More information

Combining PPO and Evolutionary Strategies for Better Policy Search

Combining PPO and Evolutionary Strategies for Better Policy Search Combining PPO and Evolutionary Strategies for Better Policy Search Jennifer She 1 Abstract A good policy search algorithm needs to strike a balance between being able to explore candidate policies and

More information

Generative Adversarial Networks. Presented by Yi Zhang

Generative Adversarial Networks. Presented by Yi Zhang Generative Adversarial Networks Presented by Yi Zhang Deep Generative Models N(O, I) Variational Auto-Encoders GANs Unreasonable Effectiveness of GANs GANs Discriminator tries to distinguish genuine data

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Stochastic Video Prediction with Deep Conditional Generative Models

Stochastic Video Prediction with Deep Conditional Generative Models Stochastic Video Prediction with Deep Conditional Generative Models Rui Shu Stanford University ruishu@stanford.edu Abstract Frame-to-frame stochasticity remains a big challenge for video prediction. The

More information

Maxout Networks. Hien Quoc Dang

Maxout Networks. Hien Quoc Dang Maxout Networks Hien Quoc Dang Outline Introduction Maxout Networks Description A Universal Approximator & Proof Experiments with Maxout Why does Maxout work? Conclusion 10/12/13 Hien Quoc Dang Machine

More information

arxiv: v2 [cs.lg] 21 Aug 2018

arxiv: v2 [cs.lg] 21 Aug 2018 CoT: Cooperative Training for Generative Modeling of Discrete Data arxiv:1804.03782v2 [cs.lg] 21 Aug 2018 Sidi Lu Shanghai Jiao Tong University steve_lu@apex.sjtu.edu.cn Weinan Zhang Shanghai Jiao Tong

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

Local Affine Approximators for Improving Knowledge Transfer

Local Affine Approximators for Improving Knowledge Transfer Local Affine Approximators for Improving Knowledge Transfer Suraj Srinivas & François Fleuret Idiap Research Institute and EPFL {suraj.srinivas, francois.fleuret}@idiap.ch Abstract The Jacobian of a neural

More information

Distance-Divergence Inequalities

Distance-Divergence Inequalities Distance-Divergence Inequalities Katalin Marton Alfréd Rényi Institute of Mathematics of the Hungarian Academy of Sciences Motivation To find a simple proof of the Blowing-up Lemma, proved by Ahlswede,

More information

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea Email: sajid@dsp.snu.ac.kr, khwang@dsp.snu.ac.kr, wysung@snu.ac.kr

More information

arxiv: v3 [stat.ml] 6 Dec 2017

arxiv: v3 [stat.ml] 6 Dec 2017 Wasserstein GAN arxiv:1701.07875v3 [stat.ml] 6 Dec 2017 Martin Arjovsky 1, Soumith Chintala 2, and Léon Bottou 1,2 1 Introduction 1 Courant Institute of Mathematical Sciences 2 Facebook AI Research The

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

Deep Belief Networks are compact universal approximators

Deep Belief Networks are compact universal approximators 1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation

More information

arxiv: v1 [cs.lg] 12 Sep 2017

arxiv: v1 [cs.lg] 12 Sep 2017 Dual Discriminator Generative Adversarial Nets Tu Dinh Nguyen, Trung Le, Hung Vu, Dinh Phung Centre for Pattern Recognition and Data Analytics Deakin University, Australia {tu.nguyen,trung.l,hungv,dinh.phung}@deakin.edu.au

More information

arxiv: v3 [cs.lg] 18 Jul 2016

arxiv: v3 [cs.lg] 18 Jul 2016 Adversarial Feature Learning Jeff Donahue, Philipp Krähenbühl, Trevor Darrell Computer Science Division University of California, Berkeley {jdonahue,philkr,trevor}@cs.berkeley.edu arxiv:1605.09782v3 [cs.lg

More information