topics about f-divergence Presented by Liqun Chen Mar 16th, 2018 1
Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments 2
f-gan: Training Generative Neural Samplers using Variational Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments 3
f-gan: Training Generative Neural Samplers using Variational Given a generative model Q from a class Q of possible models we are generally interested in performing one or multiple of the following operations: Sampling. Produce a sample from Q. By inspecting samples or calculating a function on a set of samples we can obtain important insight into the distribution or solve decision problems. Estimation. Given a set of iid samples {x 1, x 2,..., x n } from an unknown true distribution P, find Q Q that best describes the true distribution. Point-wise likelihood evaluation. Given a sample x, evaluate the likelihood Q(x). Generative adversarial networks (GAN) allows sampling and estimation. 4
f-gan: Training Generative Neural Samplers using Variational In original GAN paper, we use symmetric Jensen-Shannon divergence: D JS (P Q) = 1 2 D KL(P 1 2 (P + Q)) + 1 2 D KL(Q 1 2 (P + Q)), (1) where D KL denotes the Kullback-Leibler divergence. In this work the authors show that the principle of GANs is more general and we can extend the variational divergence estimation framework to recover the GAN training objective and generalize it to arbitrary f-divergences. 5
f-gan: Training Generative Neural Samplers using Variational Contributions of this paper: they derived the GAN training objectives for all f-divergences and provide as example additional divergence functions, including the Kullback-Leibler and Pearson divergences. they simplified the saddle-point optimization procedure of the original GAN paper and provide a theoretical justification. they provided experimental insight into which divergence function is suitable for estimating generative neural samplers for natural images. 6
f-gan: Training Generative Neural Samplers using Variational f-divergence family Given two distributions P and Q that possess, respectively, an absolutely continuous density function p and q with respect to a base measure dx defined on the domain X, we define the f-divergence, ( ) p(x) D f (P Q) = q(x)f dx, (2) q(x) where the generator function f : R + R is a convex, lower-semicontinuous function satisfying f(1) = 0. X 7
f-gan: Training Generative Neural Samplers using Variational Variational Estimation of f-divergence Every convex, lower-semicontinuous function f has a convex conjugate function f, also known as Fenchel conjugate. This function is defined as f (t) = sup {ut f(u)}. (3) u dom f The function f is again convex and lower-semicontinuous and the pair (f, f ) is dual to another in the sense that f = f. Therefore, we can also represent f as f(u) = sup t domf {tu f (t)}. 8
f-gan: Training Generative Neural Samplers using Variational Variational Estimation of f-divergence Using the conjugate function, we can rewrite f-divergence as this: { D f (P Q) = q(x) sup t p(x) } X t dom f q(x) f (t) dx ( sup T T X p(x) T (x) dx X q(x) f (T (x)) dx ) = sup T T (E x P [T (x)] E x Q [f (T (x))]), (4) where T is an arbitrary class of functions T : X R. The above derivation yields a lower bound for two reasons: because of Jensen s inequality when swapping the integration and supremum operations. the class of functions T may contain only a subset of all possible functions. 9
f-gan: Training Generative Neural Samplers using Variational 10 Variational Estimation of f-divergence By taking the variation of the lower bound in (4) w.r.t. T, we find that under mild conditions on f, the bound is tight for ( ) p(x) T (x) = f, (5) q(x) For example, the popular reverse Kullback-Leibler divergence corresponds to f(u) = log u resulting in T (x) = q(x)/p(x).
f-gan: Training Generative Neural Samplers using Variational 11 Variational we follow the generative-adversarial: using two neural networks, Q and T. Q is the generative model (generator), parametrized Q through a vector θ and write Q θ. T is the variational function (discriminator), taking as input a sample and returning a scalar, parametrized T using a vector ω and write T ω. We can learn a generative model Q θ by finding a saddle-point of the following f-gan objective function, where we minimize with respect to θ and maximize with respect to ω, F (θ, ω) = E x P [T ω (x)] E x Qθ [f (T ω (x))]. (6)
f-gan: Training Generative Neural Samplers using Variational 12 Variational To apply the variational objective (6) for different f-divergences, we need to respect the domain dom f of the conjugate functions f. The authors assume that variational function T ω is represented in the form T ω (x) = g f (V ω (x)) and rewrite the saddle objective (6) as follows: F (θ, ω) = E x P [g f (V ω (x))] + E x Qθ [ f (g f (V ω (x)))], (7) where V ω : X R without any range constraints on the output, and g f : R dom f is an output activation function specific to the f-divergence used.
f-gan: Training Generative Neural Samplers using Variational 13 Table
f-gan: Training Generative Neural Samplers using Variational Experiments 14 Experiments MNIST digits: Training divergence KDE LL (nats) ± SEM Kullback-Leibler 416 5.62 Reverse Kullback-Leibler 319 8.36 Pearson χ 2 429 5.53 Neyman χ 2 300 8.33 Squared Hellinger -708 18.1 Jeffrey -2101 29.9 Jensen-Shannon 367 8.19 GAN 305 8.97 Variational Autoencoder 445 5.36 KDE MNIST train (60k) 502 5.99
f-gans in an Information Geometric Nutshell 15 Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments
f-gans in an Information Geometric Nutshell 16 We define two import classes of distortion measure. Definition For any two distributions P and Q having respective densities P and Q absolutely continuous with respect to a base measure µ, the f-divergence between P and Q, where f : R + R is convex with f(1) = 0, is I f (P Q) = E X Q [f ( )] P (X) Q(X) = X Q(x) f ( ) P (x) dµ(x). (8) Q(x) For any convex differentiable ϕ : R d R, the (ϕ-)bregman divergence between θ and ϱ is: D ϕ(θ ϱ) = ϕ(θ) ϕ(ϱ) (θ ϱ) ϕ(ϱ), (9) where ϕ is called the generator of the Bregman divergence.
f-gans in an Information Geometric Nutshell 17 The contributions of this paper: f-gan(p, escort(q)) = D(θ ϑ) + Penalty(Q) for P and Q (with respective parameters θ and ϑ) which happen to lie in a superset of exponential families called deformed exponential families they developed a novel min-max game interpretation of eq. above.
f-gans in an Information Geometric Nutshell 18 χ-logarithm define generalizations of exponential families. Let χ : R + R + be non-decreasing. They define the χ-logarithm, log χ, as The χ-exponential is log χ (z) = z where λ is defined by λ(log χ (z)) = χ(z). 1 1 dt (10) χ(t) z exp χ (z) = 1 + λ(t)dt (11) 0
f-gans in an Information Geometric Nutshell 19 Deformed exponential family Definition A distribution P from a χ-exponential family (or deformed exponential family, χ being implicit) with convex cumulant C : Θ R and sufficient statistics φ : X R d has density given by: P χ,c(x θ, φ) = exp χ (φ(x) θ C(θ)), (12) with respect to a dominating measure µ. Here, Θ is a convex open set and θ is called the coordinate of P. The escort density (or χ-escort) of P χ,c is where Z = is the escort s normalization constant. P χ,c = 1 χ(pχ,c), (13) Z X χ(p χ,c(x θ, φ))dµ(x) (14)
f-gans in an Information Geometric Nutshell 20 Theorem Theorem for any two χ-exponential distributions P and Q with respective densities P χ,c, Q χ,c and coordinates θ, ϑ, D C(θ ϑ) = E X Q[log χ (Q χ,c(x)) log χ (P χ,c(x))]. (15) Definition For any χ-logarithm and distributions P, Q having respective densities P and Q absolutely continuous with respect to base measure µ, the KL χ divergence between P and Q is defined as: [ ( )] Q(X) KL χ(p Q) = E X P log χ. (16) P (X) Since χ is non-decreasing, log χ is convex ans so any KL χ divergence is f-divergence.
f-gans in an Information Geometric Nutshell 21 Theorem Theorem Letting distributions P = P χ,c and Q = Q χ,c for short in last theorem, we have E X Q[log χ (Q χ,c(x)) log χ (P χ,c(x))] = KL χ Q( Q P ) J(Q), with J(Q) = KL χ Q( Q Q) Theorem KL χq (Q P ) admits the variational formulation KL Q P { } χ Q( ) = sup E X P [T (X)] E Q[( X log χ X Q) (T (X))], (17) T R ++ with R ++ = R\R ++. Furthermore, letting Z denoting the normalization constant of the χ-escort of Q, the optimum T : X R ++ to eq. (17) is T (x) = 1 Z χ(q(x)) χ(p (x)). (18)
f-gans in an Information Geometric Nutshell 22 Theorem Corollary From the theorems above, we can have { } sup E X P [T (X)] E Q[( X log χ X Q) (T (X))] = D C(θ ϑ) + J(Q), (19) T R ++ where θ (resp. ϑ) is the coordinate of P (resp. Q).
f-gans in an Information Geometric Nutshell 23
f-gans in an Information Geometric Nutshell Experiments 24 MNIST