arxiv: v1 [stat.ml] 14 May 2015

Size: px

Start display at page:

Download "arxiv: v1 [stat.ml] 14 May 2015"

Ezra Curtis
5 years ago
Views:

1 Training generative neural networks via aximum ean Discrepancy optimization Gintare Karolina Dziugaite University of Cambridge Daniel. Roy University of Toronto Zoubin Ghahramani University of Cambridge arxiv: v [stat.l] 4 ay 05 Abstract We consider training a deep neural network to generate samples from an unknown distribution given i.i.d. data. We frame learning as an optimization minimizing a two-sample test statistic informally speaking, a good generator network produces samples that cause a two-sample test to fail to reject the null hypothesis. As our two-sample test statistic, we use an unbiased estimate of the maximum mean discrepancy, which is the centerpiece of the nonparametric kernel two-sample test proposed by Gretton et al. []. We compare to the adversarial nets framework introduced by Goodfellow et al. [], in which learning is a two-player game between a generator network and an adversarial discriminator network, both trained to outwit the other. From this perspective, the D statistic plays the role of the discriminator. In addition to empirical comparisons, we prove bounds on the generalization error incurred by optimizing the empirical D. Introduction In this paper, we consider the problem of learning generative models from i.i.d. data with unknown distribution P. We formulate the learning problem as one of finding a function G, called the generator, such that, given an input Z drawn from some fixed noise distribution N, the distribution of the output GZ) is close to the data s distribution P. Note that, given G and N, we can easily generate new samples despite not having an explicit representation for the underlying density. We are particularly interested in the case the generator is a deep neural network whose parameters we must learn. Rather than being used to classify or predict, these networks transport input randomness to output randomness, thus inducing a distribution. The first direct instantiation of this idea is due to ackay [7], although ackay draws connections even further back to the work of Saund [] and others on autoencoders, suggesting that generators can be understood as decoders. ackay s proposal, called density networks, uses multi-layer perceptrons LP) as generators and learns the parameters by approximating Bayesian inference. Since ackay s proposal, there has been a great deal of progress on learning generative models, especially over highdimensional spaces like images. Some of the most successful approaches have been based on restricted Boltzmann machines [0] and deep Boltzmann networks [3]. A recent example is the Neural Autoregressive Density Estimator due to Uria, urray, and Larochelle [5]. An indepth survey, however, is beyond the scope of this article. This work builds on a proposal due to Goodfellow et al. []. Their adversarial nets framework takes an indirect approach to learning deep generative neural networks: a discriminator network is trained to recognize the difference between training data and generated samples, while the generator is trained to confuse the discriminator. The resulting two-player game is cast as a minimax optimization of a differentiable objective and solved greedily by iteratively performing gradient descent steps to improve the generator and then the discriminator. Given the greedy nature of the algorithm, Goodfellow et al. [] give a careful prescription for balancing the training of the generator and the discriminator. In particular, two gradient steps on the discriminator s parameters are taken for every iteration of the generator s parameters. It is not clear at this point how sensitive this balance is as the data set and network vary. In this paper, we describe an approximation to adversarial learning that replaces the adversary with a closed-form nonparametric twosample test statistic based on the aximum ean Discrepancy D), which we adopted from the kernel two sample test []. We call our proposal D nets. We give bounds on the estimation error incurred by optimizing an empirical estimator rather than the true population D and give some illustrations on synthetic and real data. Learning to sample as optimization It is well known that, for any distribution P and any continuous distribution N on sufficiently regular spaces X and W, respectively, there is a function G : W X, such that GW ) P when W N. See, e.g., [4, Lem. 3.].) In other words, we can transform an input from a fixed input distribution N through a deterministic function, producing an output whose distribution is P. For a given family {G θ } of functions W X, called generators, we can cast the problem of learning a generative model In independent work reported in a recent preprint, Li, Swersky, and Zemel [6] also propose to use D as a training objective for generative neural networks. We leave a comparison to future work.

2 6 4 w m,. w m,d y m, 3 G D P y m,d input came from data not generator 3.5 squared D Before Before Training true data generated data true data generated data After Training After Training true data generated data 6 4 w m,. w m,d y m, 3 G y m,d kernel two-sample test statistic mu sigma pdf 4 pdf pdf Initialized Iter = 0 Iter = 50 Iter = 80 Trained. Iter = Convergence of Training Error 0.6 D Training 0. Generated Iteration Figure : top left) Comparison of adversarial nets and D nets. top right) Here we present a simple one-dimensional illustration of optimizing a generator via D. Both the training data and noise data are Gaussian distributed and we consider the class of generators given by G µ,σ) w) = µ + σw. The plot on the left shows the isocontours of the D-based cost function and the path taken by gradient descent. On right, we show the distribution of the generator before and after a number of training iterations, as compared with the data generating distribution. Here we did not resample the generated points and so we do not expect to be able to drive the D to zero and match the distribution exactly. bottom) The same procedure is repeated here for a two-dimensional dataset. On the left, we see the gradual alignment of the Gaussian-distributed input data to the Gaussian-distributed output data as the parameters of the generator G θ are optimized. The learning curve on the right shows the decrease in D obtained via gradient descent. as an optimization arg min θ δp, G θ N )), ) δ is some measure of discrepancy and G θ N ) is the distribution of G θ W ) when W N. In practice, we only have i.i.d. samples X, X,... from P, and so we optimize an empirical estimate of δp, G θ N )).. Adversarial nets Adversarial nets [] can be cast within this framework: Let {D φ } be a family of functions X [0, ], called discriminators. We recover the adversarial nets objective with the discrepancy δ AN P, G θ N )) = max φ E[ log D φ X) + log D φ Y )) ], X P and Y G θ N ). In this case, Eq. ) becomes min max V G θ, D φ ) θ φ V G θ, D φ ) = E [ log D φ X) + log D φ G θ W ))) ] for X P and W N. The output of the discriminator D φ can be interpreted as the probability it assigns to its input being drawn from P, and so V G θ, D φ ) is the expected log loss incurred when classifying the origin of a point equally likely to have been drawn from P or G θ N ). Therefore, optimizing φ maximizes the probability of distinguishing samples from P and G θ N ). Assuming that the optimal discriminator exists for every θ, the optimal generator G is that whose output distribution is closest to P, as measured by the Jensen Shannon divergence, which is minimized when G θ N ) = P. In [], the generators G θ and discriminators D φ are chosen to be multilayer perceptrons LP). In order to find a minimax solution, they propose taking alternating gradient steps along D φ and G θ. Note that the composition D φ G θ )) that appears in the value function is yet another larger) LP. This fact permits the use of the back-propagation algorithm to take gradient steps.. D as an adversary In their paper introducing adversarial nets, Goodfellow et al. [] remark that a balance must be struck between optimizing the generator and optimizing the discriminator. In particular, the authors suggest k maximization steps for every one minimization step to ensure that D φ is well synchronized with G θ during training. A large value for k, however, can lead to overfitting. In their experiments, for every step taken along the gradient with respect to G θ, they take two gradient steps with respect to D φ to bring D φ closer to the desired optimum Goodfellow, pers. comm.). It is unclear how sensitive this balance is. Regardless, while adversarial networks deliver impressive sampling performance, the optimization takes approximately 7.5 hours to train on the NIST dataset running on a nvidia GeForce GTX TITAN GPU with 6GB RA. Can we potentially speed up the process with a more tractable choice of adversary?

3 Our proposal is to replace the adversary with the kernel twosample test introduced by Gretton et al. []. In particular, we replace the family of discriminators with a family H of test functions X R, closed under negation, and use the maximum mean discrepancy between P and G θ N ) over H, given by δ DH P, G θ N )) = sup E[fX)] E[fY )], ) f H X P and Y G θ N ). See Fig. for a comparison of the architectures of adversarial and D nets. While Eq. ) involves a maximization over a family of functions, Gretton et al. [] show that it can be solved in closed form when H is a reproducing kernel Hilbert space RKHS). ore carefully, let H be a reproducing kernel Hilbert space RKHS) of real-valued functions on Ω and let, H denote its inner product. By the reproducing property it follows that there exists a reproducing kernel k H such that every f H can be expressed as fx) = f, k, x) H = α i kx, x i ) 3) The functions induced by a kernel k are those functions in the closure of the span of the set {k, x) : x Ω}, which is necessarily an RKHS. Note, that for every positive definite kernel there is a unique RKHS H such that every function in H satisfies Eq. 3). Assume that X is a nonempty compact metric space and F a class of functions f : X R. Let p and q be Borel probability measures on X, and let X and Y be random variables with distribution p and q, respectively. The maximum mean discrepancy D) between p and q is DF, p, q) = sup E[fX)] E[fY )] 4) If F is chosen to be an RKHS H, then D F, p, q) = µ p µ q H 5) µ p H is the mean embedding of p, given by µ p = kx, ) pdx) H 6) and satisfying, for all f H, X E[fX)] = f, µ p H. The properties of DH,, ) depend on the underlying RKHS H. For our purposes, it suffices to say that if we take X to be R D and consider the RKHS H induced by Gaussian or Laplace kernels, then D is a metric, and so the minimum of our learning objective is achieved uniquely by P, as desired. For more details, see Sriperumbudur et al. [].) In practice, we often do not have access to p or q. Instead, we are given independent i.i.d. data X, X, X,..., X N and Y, Y, Y,..., Y fom p and q, respectively, and would like to estimate the D. Gretton et al. [] showed that D [H, p, q] = E[kX, X ) kx, Y ) + ky, Y )] 7) Algorithm Stochastic gradient descent for D nets. Initialize, θ, α, k Randomly divide training set X into N mini mini batches for i, number-of-iterations do Regenerate noise inputs {w i } i=,..., every r iterations for n mini, N mini do for m, do y m G θ w m ) end for compute the n th minibatch s gradient C n) update learning rate α e.g., RSPROP) θ θ α C n end for end for and then proposed an unbiased estimator D u[h, X, Y ] = 3 D Nets NN ) + ) N kx n, x n ) n n ky m, y m ) m m N kx n, y m ). m= n= With an unbiased estimator of the D objective in hand, we can now define our proposal, D nets: Fix a neural network G θ, θ represents the parameters of the network. Let W = w,..., w ) denote noise inputs drawn from N, let Y θ = y,..., y m ) with y j = G θ w j ) denote the noise inputs transformed by the network G θ, and let X = x,.., x N ) denote the training data in R D. Given a positive definite kernel k on R D, we minimize CY θ, X) as a function of θ, CY θ, X) = ) N ky m, y m ) m m N ky m, x n ). m= n= Note that CY θ, X) is comprised of only those parts of the unbiased estimator that depend on θ. In practice, the minimization is solved by gradient descent, possibly on subsets of the data. ore carefully, the chain rule gives us CY θ, X) = N N n= m= 8) 9) C n Y θ, X n ) G θ w m ), 0) y m θ C n Y θ, X n ) = ky m, y m ) ) m m ky m, x n ). m= )

4 Each derivative CnY θ,x n) y m is easily computed for standard kernels like the RBF kernel. Our gradient CY θ, X n ) depends on the partial derivatives of the generator with respect to its parameters, which we can compute using back propagation. 4 Generalization bounds for D D nets operate by minimizing an empirical estimate of the D. This estimate is subject to onte Carlo error and so the network weights parameters) ˆθ that are found to minimize the empirical D may do a poor job at minimizing the exact population D. We show that, for sufficiently large data sets, this estimation error is bounded, despite the space of parameters θ being continuous and high dimensional. Let Θ denote the space of possible parameters for the generator G θ, let N be the distribution on W for the noisy inputs, and let p θ = G θ N ) be the distribution of G θ W ) when W N for θ Θ. Let ˆθ be the value optimizing the unbiased empirical D estimate, i.e., D uh, X, Yˆθ) = inf θ D uh, X, Y θ ), ) and let θ be the value optimizing the population D, i.e., D H, p data, p θ ) = inf θ D H, p data, p θ ). 3) We are interested in bounding the difference D H, p data, pˆθ) D H, p data, p θ ). 4) To that end, for a measured space X, write L X ) for the space of essentially bounded functions on X and write BL X )) for the unit ball under the sup norm, i.e., BL X )) = {f : X R : x X )fx) [, ]}. The bounds we obtain will depend on a notion of complexity captured by the fat-shattering dimension: Definition Fat-shattering [8]). Let X N = {x,..., x N } X and F BL X )). For every ε > 0, X N is said to be ε-shattered by F if there is some function h : X R, such that for every I {,..., N} there is some f I F for which f I x n ) hx n ) + ε if n I, 5) f I x n ) hx n ) ε if n / I. 6) For every ε, the fat-shattering dimension of F, written fat ε F), is defined as fat ε F) = sup { X N : X N X, X N is ε-shattered by F} We then have the following bound on the estimation error: Theorem estimation error). Assume the kernel is bounded by one. Define and G k+ = {g = kg θ w), G θ )) : w W, θ Θ} 7) G X k+ = {g = kx, G θ )) : x X, θ Θ}. 8) Assume there exists γ, γ > and p, p N such that, for all ε > 0, it holds that fat ε G k+ ) γ ε p and fat ε Gk+ X ) γ ε p. Then with probability at least δ, with D H, p data, pˆθ) < D H, p data, p θ ) + ε, 9) ε = rp, γ, ) + rp, γ, ) + the rate rp, γ, N) is if p <, rp, γ, ) = C p γ log 3 ) if p =, p if p >, log δ, for constants C p and C p depending on p and p alone. 0) ) The proof appears in the appendix. We can obtain simpler, but slightly more restrictive, hypotheses if we bound the fatshattering dimension of the class of generators {G θ : θ Θ} alone: Take the observation space X to be a bounded subset of a finite-dimensional Euclidean space and the kernel to be Lipschitz continuous and translation invariant. For the RBF kernel, the Lipschitz constant is proportional to the inverse of the length-scale: the resulting bound loosens as the length scale shrinks. 5 Empirical evaluation In this section, we demonstrate the approach on an illustrative synthetic example as well as the standard NIST digits and Toronto Face Dataset TFD) benchmarks. We show that Dbased optimization of the generator rapidly delivers a generator that performs well in maximizing the density of a held-out test set under a kernel-density estimator. 5. Gaussian data, kernel, and generator Under an RBF kernel and Gaussian generator with parameters θ = {µ, σ}, it is straightforward to find the gradient of CY θ, X) by applying the chain rule. Using fixed random standard normal numbers {w,..., w }, we have y m = µ + σw m for m {,.., }. The result of these illustrative synthetic experiments can be found in Fig.. The dataset consisted of N = 00 samples from a standard normal and = 50 noise input samples were generated from a standard normal with a fixed random seed. The algorithm was initialized at values {µ, σ} = {.5, 0.}. We fixed the learning rate to 0.5 and ran gradient descent steps for K = 50 iterations. 5. NIST digits We trained our generative network on the NIST digits dataset [5]. The generator was chosen to be a fully connected, 3 hidden layers neural network with sigmoidal activation functions. Following Gretton et al. [], we used a radial basis function RBF) kernel, but also evaluated the rational quadratic RQ) kernel [9] and Laplacian kernel, but found that the RBF performed best

5 Figure : top-left) NIST digits from the training set. top-right) Newly generated digits produced after,000,000 iterations approximately 5 hours). Despite the remaining artifacts, the resulting kernel-density estimate of the test data is state of the art. top-center) Newly generated digits after 300 further iterations optimizing the associated empirical D. bottom-left) D learning curves for first 000 iterations. bottom-right) D learning curves from 000 to 500,000 iterations. Note the difference in y-axis scale. No appreciable change is seen in later iterations. in the parameter ranges we evaluated. We used Bayesian optimization WHETLab) to set the bandwidth of the RBF and the number of neurons in each layer on initial test runs of 50,000 iterations. We used the median heuristic suggested by [] for the kernel two-sample test to choose the kernel bandwidth. The learning rate was adjusting during optimization by RSPROP [4]. Fig. presents the digits learned after,000,000 iterations. We performed minibatch stochastic gradient descent, resampling the generated digits every 300 iterations, using minibatches of size 500, with equal numbers of training and generated points. It is clear that the digits produced have many artifacts not appearing in the NIST data set. Despite this, the mean log density of the held-out test data is 35 ±, as compared with the reported 5 ± mean log density achieved by adversarial nets. There are several possible explanations for this. First, kernel density estimation is known to perform poorly in high dimensions. Second, the D objective can itself be understood as the squared difference of two kernel density estimates, and so, in a sense, the objective being optimized is directly related to the subsequent mean test log density evaluation. There is no clear connection for adversarial networks, which might explain why it suffers under this test. Our experience suggests that the RBF kernel delivers base line performance but that an imagespecific kernel, capturing, e.g., shift invariance, might lead to better images. 5.3 Toronto face dataset We have also trained the generative D network on Toronto face dataset TFD) [3]. The parameters were adapted from the NIST experiment: we also used a 3-hidden-layer sigmoidal LP with similar architecture 000, 600, and 000 units) and RBF kernel for the cost function with the same hyper parameter. The training dataset batch sizes were equal to the number of generated points 500). The generated points were resampled every 500 iterations. The network was optimized for 500,000 iterations. The samples from the resulting network are plotted in Fig. 3. The mean log density of the held-out test set is 83 ± 39. Although this figure is higher than the mean log density of 057 ± 6 reported in [], the samples from the D network are again clearly distinguishable from the training dataset. Thus the high test score suggests that kernel density estimation does not perform well at evaluating the performance for these high dimensional datasets. 6 Conclusion D offers a closed form surrogate for the discriminator in adversarial nets framework. After using Bayesian optimization for the parameters, we found that the network outperformed the adversarial network in terms of the density of the held-out test set under kernel density estimation. On the other hand, there is a clear discrepancy between the digits produced by D Nets and the NIST digits, which might suggest that KDE is not up to the task of evaluating these models. Given how quickly

6 Figure 3: left) TFD. right) Faces generated by network trained for 500,000 iterations. center) Generated points after 500 iterations. D Nets achieves this level of performance, it is worth considering its use as an initialization for more costly procedures. only, such that 3 RN F) Cp γ log N p N Acknowledgments The authors would like to thank Bharath Sriperumbudur for technical discussions. A Proofs Theorem 4 []). Assume 0 kxi, xj ) K, = N. Then Pr Du H, X, Yθ ) D H, pdata, pθ ) > ε δε 3) ε δε = exp. 6K Definition [8]). A random variable σ is said to be a Rademacher random variable if it takes values in {, }, each with probability /. Definition 3 [8]). Let µ be a probability measure on X, and let F be a class of uniformly bounded functions on X. Then the Rademacher complexity of F with respect to µ) is RN F) = Eµ Eσ,...,σN N X sup σn f Xn ) N f F n= #, σ = σ, σ,... ) is a sequence of independent Rademacher random variables, and X, X,... are independent, µ-distributed random variables, independent also from σ. Theorem cdiarmids Inequality [8]). Let f : X XN R and assume there exists c,..., cn 0 such that, for all k {,..., N }, we have sup x,...,xk,x0k,...,xn f x,..., xk,..., xn ) f x,..., x0k,..., xn ) ck. Then, for all ε > 0 and independent random variables ξ,..., ξn in X, Pr {f ξ,..., ξn ) Ef ξ,..., ξn )) ε)}! ε < exp PN. n= cn ) We begin with some preliminaries and known results: " if 0 < p < if p = if p >. Theorem 3 [8, Thm..35]). Let F BL X )). Assume there exists γ >, such that for all ε > 0, fatε F) γεp for some p N. Then there exists constants Cp depending on p 4) The case Θ is a finite set is elementary: Theorem 5 estimation error for finite parameter set). Let pθ be the distribution of Gθ W ), with θ taking values in some finite set Θ = {θ,..., θt }, T <. Then, with probability at least T + )δε, δε is defined as in Theorem 4, we have D H, pdata, pθ ) < D H, pdata, pθ ) + ε. 5) Proof. Let Eθ) = Du H, X, Yθ ) 6) T θ) = D H, pdata, pθ ). 7) and Note, that the upper bound stated in Theorem 4 holds for the parameter value θ, i.e., Pr [ Eθ ) T θ ) > ε] δε. 8) Because θ depends on the training data X and generator data Y, we use a uniform bound that holds over all θ. Specifically, h i Pr Eθ ) T θ ) > ε Pr sup Eθ) T θ) > ε θ T X t= h i Pr Eθ ) T θ ) > ε T δε.

7 This yields that with probability at least T δ ε, ε Eˆθ) T ˆθ) + Eθ ) T θ ) 9) Eθ ) Eˆθ) + T ˆθ) T θ ). 30) Since θ was chosen to minimize T θ), we know that Similarly, by Eq. ), Therefore it follows that T ˆθ) T θ ). 3) Eθ ) Eˆθ). 3) ε T ˆθ) T θ ) 33) = D H, p data, p θ ) D H, p data, pˆθ) 34) proving the theorem. Corollary. With probability at least δ, D H, p data, pˆθ) < D H, p data, p θ ) + ε δ, ε δ = 8K log [T + )δ]. In order to prove the general result, we begin with some technical lemmas. The development here owes much to Gretton et al. []. Lemma. Let F = {f : Y Y R} and F + = {h = fy, ) : f F, y Y} BL Y)). Let {Y n } N n= be µ-distributed independent random variables in Y. Assume for some γ > and some p N, we have fat ε F + ) γε p, for all ε > 0. For y n Y n =,..., N, define ρy,..., y N ) to be supe fy, Y )) fy n, y n ). NN ) n n Then there exists a constant C that depends on p, such that N if p < E ρy,..., Y N )) Cγ log 3 N) N if p = if p >. N) p Proof. Let us introduce {ζ n } N n=, ζ n and Y n have the same distribution and are independent for all n, n {,..., N}. Then the following is true: EfY, Y ) )) = E fζ n, ζ n ) NN ) n,n : n n Using Jensen s inequality and the independence of Y, Y and Y n, Y n, we have E ρy,..., Y N )) 35) = E sup EfY, Y )) ) 36) fy m, Y m ) NN ) n n E sup fζ n, ζ NN ) n) n n ) 37) fy n, Y n ) NN ). n n Introducing conditional expectations allows us to rewrite the equation with the sum over n outside the expectations. I.e., Eq. 36) equals to N E E Yn,ζn) ) ) sup fζ n, ζ n ) fy n, Y n )) n N n n 38) =E E Y,ζ) N ) ) sup σ n fζ, ζ n ) fy, Y n )). N n= 39) The second equality follows by symmetry of random variables {ζ n } N n=. Note that we also added Rademacher random variables {σ n } N n= before each term in the sum since fζ n, ζ n ) fy n, Y n )) has the same distribution as fζ n, ζ n ) fy n, Y n )) for all n, n and therefore the σ s do not affect the expectation of the sum. Note that ζ m and Y m are identically distributed. Thus the triangle inequality implies that Eq. 39) is less than or equal to E N E Y ) sup N σ n fy, Y n ) )) n= N R N F + ), R N F + ) is the Rademacher s complexity of F +. Then by Theorem 3, we have N if p < E ρy,..., Y N )) Cγ log 3 N) N if p = if p >. N) p Lemma. Let F = {f : X Y R} and 40) F + = {f : x Y R, x X }. 4) and F + BL Y)). Let {X n } N n= and {Y m } n= be ν- and µ-distributed independent random variables in X and Y, respectively. Assume for some γ >, such that for all ε > 0,

8 fat ε F + ) γε p, for some p N. For all x n X, n N, and all y m Y, m, define ρx,..., x N, y,..., y ) = supefx, Y ) fx n, y m ). N n,m Then there exists C that depends on p, such that E ρx,..., X N, Y,..., Y )) if p < Cγ log 3 ) if p = if p >. ) p Proof. The proof is very similar to that of Lemma. Proof of Theorem. The proof follows the same steps as the proof of Theorem 5 apart from a stronger uniform bound stated in Eq. 9). I.e., we need to show: [ ] Pr sup Eθ) T θ) ε δ. 4) Expanding D as defined by Eqs. 7) and 8), and substituting Y = G θ W ), yields sup Eθ) T θ) 43) = sup EkX, X )) NN ) kx n, X n ) n n + EkG θ W ), G θ W ))) kg θ W m ), G θ W m )) ) m m EkX, G θ W ))) + kx n, G θ W m )). N m,n 44) For all n {,..., N}, kx n, X n ) does not depend on θ and therefore the first two terms of the equation above can be taken out of the supremum. Also, note that since k, ) K, we have ζx,..., x n,..., x N ) ζx,..., x n,..., x N ) K N, ζx,..., x N ) = NN ) n,n : n n kx n, x n ), and ζ is an unbiased estimate of EkX, X )). Then from c- Diarmid s inequality on ζ, we have EkX, Pr X ) )) kx n, X n ) ε NN ) n n ) exp ε K N. 45) Therefore Eq. 44) is bounded by the sum of the bound on Eq. 45) and the following: sup EkG θ W ), G θ W ))) ) m m kg θ W m ), G θ W m )) EkX, G θ W ))) + kx n, G θ W m )). N m,n 46) Thus the next step is to find the bound for the supremum above. Define and fw,..., W ; p noise ) = fw ) = sup EkG θ W ), G θ W ))) kg θ W m ), G θ W m )) ) m m hx,..., X N, W,..., W ; p data, p noise ) = hx N, W ) = sup kx n, G θ W m )) EkX, G θ W ))). N m,n Then by triangle inequality Eq. 46) fw ) + hx N, W ). 47) We will first find the upper bound on fw ), i.e., for every ε > 0, we will show that there exists δ f, such that Pr fw ) > ε) δ f 48) For each m {,..., }, fw,..., W m,..., W ) 49) fw,..., W m,..., W ) K 50) since the kernel is bounded by K, and therefore kg θ W m ), G θ W m )) is bounded by K for all m. The conditions of Theorem are satisfied and thus we can use cdiarmids Inequality on f: ) Pr fw ) EfW )) ɛ) exp ε K. 5) Define G k = {kg θ ), G θ )) : θ Θ} 5) To show Eq. 48), we need to bound the expectation of f. We can apply Lemma on the function classes G k and G k+. The resulting bound is if p < EfW )) ε p = C f γ log 3 ) if p = ) p if p >., 53)

9 p and γ are parameters associated to fat shattering dimension of G k+ as stated in the assumptions of the theorem, and C f is a constant depending on p. Now we can write down the bound on f: ) Pr fw ) ε p + ɛ) exp ε K = δ f. 54) Similarly, hx N, W ) has bounded differences: hx,..., X n,..., X N, W,..., W ) hx,..., X n,..., X N, W,..., W ) K N 55) and hx,..., X N, W,..., W m,..., W ) hx,..., X N, W,..., W m,..., W ) K 56). cdiarmid s inequality then implies Pr hx N, W ) EhX N, W ) ε) exp ε K N N + ). 57) We can bound expectation of hx N, W ) using Lemma applied on G X k and GX k+, Then G X k = {k, G θ )) : θ Θ}. 58) EhX N, W )) ε p = C h γ if p < log 3 ) if p = p if p >. 59) for some constant C h that depends on The final bound on h is then Pr hx N, W ) ε p + ε) exp ε K ) N = δ h. N + 60) Summing up the bounds from Eq. 54) and Eq. 57), it follows that Pr fw ) + hx N, W ) ε p + ε p + 3ε) maxδ f, δ h ) = δ h. 6) Using the bound in Eq. 45), we have obtain the uniform bound we were looking for: [ ] Pr sup Eθ) T θ) > ε p + ε p + 4ε δ h, 6) which by Eq. 9) yields [ ] Pr Eˆθ) T ˆθ) > ε p + ε p + 4ε δ h. 63) Since it was assumed that K = and N =, we get ) δ h = exp ε. 64) 4 To finish, we proceed as in the proof of Theorem 5. We can rearrange some of the terms to get a different form of Eq. 8): Pr [ Eθ ) T θ ) > ε] ) exp ε = δ h. 4 All of the above implies that for any ε > 0, there exists δ, such that Pr D H, p data, pˆθ) D H, p data, p θ ) ε ) δ, We can rewrite ε as: ε = ε p + ε p + log δ. 65) ε = rp, γ, ) + rp, γ, ) + The rate rp, γ, N) is given by Eq. 53) and Eq. 59): if p <, rp, γ, ) = C p γ log 3 ) if p =, p if p >, log δ, 66) 67) the constants C p and C p depend on p and p alone. We close by noting that the approximation error is zero in the nonparametric limit. Theorem 6 Gretton et al. []). Let F be the unit ball in a universal RKHS H, defined on the compact metric space X, with associated continuous kernel k, ). Then D[H, p, q] = 0 if and only if p = q. Corollary approximation error). Assume p data is in the family {p θ } and that H is an RKHS induced by a characteristic kernel. Then inf θ DH, p data, p θ ) = 0 68) and the infimum is achieved at θ satisfying p θ = p data. Proof. By Theorem 6, it follows that D H,, ) is a metric. The result is then immediate.

10 References [] I. J. Goodfellow, J. Pouget-Abadie,. irza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative Adversarial Nets. 04. [] A. Gretton, K.. Borgwardt,. J. Rasch, B. Schölkopf, and A. Smola. A Kernel Two-sample Test. In: J. ach. Learn. Res. 3 ar. 0), pp [3] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. In: Science July 006), pp [4] O. Kallenberg. Foundations of modern probability. nd. New York: Springer, 00, pp. xx+638. [5] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In: Proceedings of the IEEE. 998, pp [6] Y. Li, K. Swersky, and R. Zemel. Generative oment atching Networks [7] D. J. ackay. Bayesian Neural Networks and Density Networks. In: Nuclear Instruments and ethods in Physics Research, A. 994, pp [8] S. endelson. A Few Notes on Statistical Learning Theory. English. In: Advanced Lectures on achine Learning. Ed. by S. endelson and A. Smola. Vol Lecture Notes in Computer Science. Springer Berlin Heidelberg, 003, pp. 40. [9] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for achine Learning Adaptive Computation and achine Learning). The IT Press, 005. [0] R. Salakhutdinov and G. E. Hinton. Deep Boltzmann achines. In: Journal of achine Learning Research - Proceedings Track 5 009), pp [] E. Saund. Dimensionality-Reduction Using Connectionist Networks. In: IEEE Trans. Pattern Anal. ach. Intell ), pp [] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Schölkopf. Injective Hilbert Space Embeddings of Probability easures. In: Conf. Comp. Learn. Theory, COLT) [3] J.. Susskind, A. K. Anderson, and G. E. Hinton. The Toronto face database. Tech. rep. 00. [4] T. Tieleman and G. Hinton. Lecture 6.5 RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for achine Learning. 0. [5] B. Uria, I. urray, and H. Larochelle. A Deep and Tractable Density Estimator. In: CoRR abs/ ).

Training generative neural networks via Maximum Mean Discrepancy optimization

Training generative neural networks via Maximum Mean Discrepancy optimization Gintare Karolina Dziugaite University of Cambridge Daniel M. Roy University of Toronto Zoubin Ghahramani University of Cambridge