topics about f-divergence

Similar documents
Generative Adversarial Networks

Supplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization

Nishant Gurnani. GAN Reading Group. April 14th, / 107

f-gan: Training Generative Neural Samplers using Variational Divergence Minimization

arxiv: v1 [stat.ml] 2 Jun 2016

On the Chi square and higher-order Chi distances for approximating f-divergences

On the Chi square and higher-order Chi distances for approximating f-divergences

AdaGAN: Boosting Generative Models

Advanced Machine Learning & Perception

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

Wasserstein GAN. Juho Lee. Jan 23, 2017

Posterior Regularization

PATTERN RECOGNITION AND MACHINE LEARNING

Lecture 35: December The fundamental statistical distances

Robustness and duality of maximum entropy and exponential family distributions

Statistical Machine Learning Lectures 4: Variational Bayes

Variational Autoencoders (VAEs)

Machine learning - HT Maximum Likelihood

Applications of Information Geometry to Hypothesis Testing and Signal Detection

1/37. Convexity theory. Victor Kitov

Variational Inference. Sargur Srihari

Relative Loss Bounds for Multidimensional Regression Problems

Expectation Maximization

U Logo Use Guidelines

6.1 Variational representation of f-divergences

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Introduction to Statistical Learning Theory

f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models

Variational Principal Components

Lecture 21: Minimax Theory

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Expectation Maximization

Information Theory in Intelligent Decision Making

Variational Inference. Sargur Srihari

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Some theoretical properties of GANs. Gérard Biau Toulouse, September 2018

Generative MaxEnt Learning for Multiclass Classification

Conjugate Predictive Distributions and Generalized Entropies

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Week 3: The EM algorithm

Support Vector Machines

Discrete Markov Random Fields the Inference story. Pradeep Ravikumar

Quiz 2 Date: Monday, November 21, 2016

Class-prior Estimation for Learning from Positive and Unlabeled Data

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

Accelerated Training of Max-Margin Markov Networks with Kernels

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Online Convex Optimization

STA 4273H: Statistical Machine Learning

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Probabilistic Graphical Models

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

Variational Autoencoder

ECE 4400:693 - Information Theory

Computational Systems Biology: Biology X

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Inference in non-linear time series

Energy-Based Generative Adversarial Network

Surrogate loss functions, divergences and decentralized detection

Density Estimation: ML, MAP, Bayesian estimation

Information Projection Algorithms and Belief Propagation

An introduction to Variational calculus in Machine Learning

Belief Propagation, Information Projections, and Dykstra s Algorithm

Algebraic Information Geometry for Learning Machines with Singularities

Probabilistic Graphical Models

Lecture 2: August 31

13: Variational inference II

14 : Mean Field Assumption

Information geometry of Bayesian statistics

Lecture 16: FTRL and Online Mirror Descent

EECS 750. Hypothesis Testing with Communication Constraints

Statistical Inference

On Improved Bounds for Probability Metrics and f- Divergences

Are You a Bayesian or a Frequentist?

A Representation Approach for Relative Entropy Minimization with Expectation Constraints

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Pattern Recognition. Parameter Estimation of Probability Density Functions

Machine Learning Basics: Maximum Likelihood Estimation

Information Theory and Hypothesis Testing

Bayesian Machine Learning - Lecture 7

Lecture 14. Clustering, K-means, and EM

Chapter 4. Theory of Tests. 4.1 Introduction

Minimax lower bounds I

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,

Generative models for missing value completion

Homework 7: Solutions. P3.1 from Lehmann, Romano, Testing Statistical Hypotheses.

Chapter 20. Deep Generative Models

Lecture 13 : Variational Inference: Mean Field Approximation

Probabilistic and Bayesian Machine Learning

MMD GAN 1 Fisher GAN 2

A Geometric View of Conjugate Priors

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

Context tree models for source coding

Transcription:

topics about f-divergence Presented by Liqun Chen Mar 16th, 2018 1

Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments 2

f-gan: Training Generative Neural Samplers using Variational Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments 3

f-gan: Training Generative Neural Samplers using Variational Given a generative model Q from a class Q of possible models we are generally interested in performing one or multiple of the following operations: Sampling. Produce a sample from Q. By inspecting samples or calculating a function on a set of samples we can obtain important insight into the distribution or solve decision problems. Estimation. Given a set of iid samples {x 1, x 2,..., x n } from an unknown true distribution P, find Q Q that best describes the true distribution. Point-wise likelihood evaluation. Given a sample x, evaluate the likelihood Q(x). Generative adversarial networks (GAN) allows sampling and estimation. 4

f-gan: Training Generative Neural Samplers using Variational In original GAN paper, we use symmetric Jensen-Shannon divergence: D JS (P Q) = 1 2 D KL(P 1 2 (P + Q)) + 1 2 D KL(Q 1 2 (P + Q)), (1) where D KL denotes the Kullback-Leibler divergence. In this work the authors show that the principle of GANs is more general and we can extend the variational divergence estimation framework to recover the GAN training objective and generalize it to arbitrary f-divergences. 5

f-gan: Training Generative Neural Samplers using Variational Contributions of this paper: they derived the GAN training objectives for all f-divergences and provide as example additional divergence functions, including the Kullback-Leibler and Pearson divergences. they simplified the saddle-point optimization procedure of the original GAN paper and provide a theoretical justification. they provided experimental insight into which divergence function is suitable for estimating generative neural samplers for natural images. 6

f-gan: Training Generative Neural Samplers using Variational f-divergence family Given two distributions P and Q that possess, respectively, an absolutely continuous density function p and q with respect to a base measure dx defined on the domain X, we define the f-divergence, ( ) p(x) D f (P Q) = q(x)f dx, (2) q(x) where the generator function f : R + R is a convex, lower-semicontinuous function satisfying f(1) = 0. X 7

f-gan: Training Generative Neural Samplers using Variational Variational Estimation of f-divergence Every convex, lower-semicontinuous function f has a convex conjugate function f, also known as Fenchel conjugate. This function is defined as f (t) = sup {ut f(u)}. (3) u dom f The function f is again convex and lower-semicontinuous and the pair (f, f ) is dual to another in the sense that f = f. Therefore, we can also represent f as f(u) = sup t domf {tu f (t)}. 8

f-gan: Training Generative Neural Samplers using Variational Variational Estimation of f-divergence Using the conjugate function, we can rewrite f-divergence as this: { D f (P Q) = q(x) sup t p(x) } X t dom f q(x) f (t) dx ( sup T T X p(x) T (x) dx X q(x) f (T (x)) dx ) = sup T T (E x P [T (x)] E x Q [f (T (x))]), (4) where T is an arbitrary class of functions T : X R. The above derivation yields a lower bound for two reasons: because of Jensen s inequality when swapping the integration and supremum operations. the class of functions T may contain only a subset of all possible functions. 9

f-gan: Training Generative Neural Samplers using Variational 10 Variational Estimation of f-divergence By taking the variation of the lower bound in (4) w.r.t. T, we find that under mild conditions on f, the bound is tight for ( ) p(x) T (x) = f, (5) q(x) For example, the popular reverse Kullback-Leibler divergence corresponds to f(u) = log u resulting in T (x) = q(x)/p(x).

f-gan: Training Generative Neural Samplers using Variational 11 Variational we follow the generative-adversarial: using two neural networks, Q and T. Q is the generative model (generator), parametrized Q through a vector θ and write Q θ. T is the variational function (discriminator), taking as input a sample and returning a scalar, parametrized T using a vector ω and write T ω. We can learn a generative model Q θ by finding a saddle-point of the following f-gan objective function, where we minimize with respect to θ and maximize with respect to ω, F (θ, ω) = E x P [T ω (x)] E x Qθ [f (T ω (x))]. (6)

f-gan: Training Generative Neural Samplers using Variational 12 Variational To apply the variational objective (6) for different f-divergences, we need to respect the domain dom f of the conjugate functions f. The authors assume that variational function T ω is represented in the form T ω (x) = g f (V ω (x)) and rewrite the saddle objective (6) as follows: F (θ, ω) = E x P [g f (V ω (x))] + E x Qθ [ f (g f (V ω (x)))], (7) where V ω : X R without any range constraints on the output, and g f : R dom f is an output activation function specific to the f-divergence used.

f-gan: Training Generative Neural Samplers using Variational 13 Table

f-gan: Training Generative Neural Samplers using Variational Experiments 14 Experiments MNIST digits: Training divergence KDE LL (nats) ± SEM Kullback-Leibler 416 5.62 Reverse Kullback-Leibler 319 8.36 Pearson χ 2 429 5.53 Neyman χ 2 300 8.33 Squared Hellinger -708 18.1 Jeffrey -2101 29.9 Jensen-Shannon 367 8.19 GAN 305 8.97 Variational Autoencoder 445 5.36 KDE MNIST train (60k) 502 5.99

f-gans in an Information Geometric Nutshell 15 Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments

f-gans in an Information Geometric Nutshell 16 We define two import classes of distortion measure. Definition For any two distributions P and Q having respective densities P and Q absolutely continuous with respect to a base measure µ, the f-divergence between P and Q, where f : R + R is convex with f(1) = 0, is I f (P Q) = E X Q [f ( )] P (X) Q(X) = X Q(x) f ( ) P (x) dµ(x). (8) Q(x) For any convex differentiable ϕ : R d R, the (ϕ-)bregman divergence between θ and ϱ is: D ϕ(θ ϱ) = ϕ(θ) ϕ(ϱ) (θ ϱ) ϕ(ϱ), (9) where ϕ is called the generator of the Bregman divergence.

f-gans in an Information Geometric Nutshell 17 The contributions of this paper: f-gan(p, escort(q)) = D(θ ϑ) + Penalty(Q) for P and Q (with respective parameters θ and ϑ) which happen to lie in a superset of exponential families called deformed exponential families they developed a novel min-max game interpretation of eq. above.

f-gans in an Information Geometric Nutshell 18 χ-logarithm define generalizations of exponential families. Let χ : R + R + be non-decreasing. They define the χ-logarithm, log χ, as The χ-exponential is log χ (z) = z where λ is defined by λ(log χ (z)) = χ(z). 1 1 dt (10) χ(t) z exp χ (z) = 1 + λ(t)dt (11) 0

f-gans in an Information Geometric Nutshell 19 Deformed exponential family Definition A distribution P from a χ-exponential family (or deformed exponential family, χ being implicit) with convex cumulant C : Θ R and sufficient statistics φ : X R d has density given by: P χ,c(x θ, φ) = exp χ (φ(x) θ C(θ)), (12) with respect to a dominating measure µ. Here, Θ is a convex open set and θ is called the coordinate of P. The escort density (or χ-escort) of P χ,c is where Z = is the escort s normalization constant. P χ,c = 1 χ(pχ,c), (13) Z X χ(p χ,c(x θ, φ))dµ(x) (14)

f-gans in an Information Geometric Nutshell 20 Theorem Theorem for any two χ-exponential distributions P and Q with respective densities P χ,c, Q χ,c and coordinates θ, ϑ, D C(θ ϑ) = E X Q[log χ (Q χ,c(x)) log χ (P χ,c(x))]. (15) Definition For any χ-logarithm and distributions P, Q having respective densities P and Q absolutely continuous with respect to base measure µ, the KL χ divergence between P and Q is defined as: [ ( )] Q(X) KL χ(p Q) = E X P log χ. (16) P (X) Since χ is non-decreasing, log χ is convex ans so any KL χ divergence is f-divergence.

f-gans in an Information Geometric Nutshell 21 Theorem Theorem Letting distributions P = P χ,c and Q = Q χ,c for short in last theorem, we have E X Q[log χ (Q χ,c(x)) log χ (P χ,c(x))] = KL χ Q( Q P ) J(Q), with J(Q) = KL χ Q( Q Q) Theorem KL χq (Q P ) admits the variational formulation KL Q P { } χ Q( ) = sup E X P [T (X)] E Q[( X log χ X Q) (T (X))], (17) T R ++ with R ++ = R\R ++. Furthermore, letting Z denoting the normalization constant of the χ-escort of Q, the optimum T : X R ++ to eq. (17) is T (x) = 1 Z χ(q(x)) χ(p (x)). (18)

f-gans in an Information Geometric Nutshell 22 Theorem Corollary From the theorems above, we can have { } sup E X P [T (X)] E Q[( X log χ X Q) (T (X))] = D C(θ ϑ) + J(Q), (19) T R ++ where θ (resp. ϑ) is the coordinate of P (resp. Q).

f-gans in an Information Geometric Nutshell 23

f-gans in an Information Geometric Nutshell Experiments 24 MNIST