Energy-Based Generative Adversarial Network

Similar documents
Learning to Sample Using Stein Discrepancy

MMD GAN 1 Fisher GAN 2

Reinforcement Learning as Variational Inference: Two Recent Approaches

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Generative adversarial networks

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

arxiv: v1 [stat.ml] 19 Jan 2018

Wasserstein GAN. Juho Lee. Jan 23, 2017

Variational Autoencoder

Variational Autoencoders (VAEs)

Generative Adversarial Networks

Feedforward Neural Networks

B PROOF OF LEMMA 1. Published as a conference paper at ICLR 2018

Composite Functional Gradient Learning of Generative Adversarial Models. Appendix

Information geometry for bivariate distribution control

Neural Networks: Backpropagation

Robustness in GANs and in Black-box Optimization

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

A Unified View of Deep Generative Models

GANs. Machine Learning: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM GRAHAM NEUBIG

The Origin of Deep Learning. Lili Mou Jan, 2015

Feedforward Neural Networks. Michael Collins, Columbia University

Gradient-Based Learning. Sargur N. Srihari

arxiv: v7 [cs.lg] 27 Jul 2018

Neural Network Training

GENERATIVE ADVERSARIAL LEARNING

Logistic Regression. COMP 527 Danushka Bollegala

Probabilistic Graphical Models & Applications

Linear & nonlinear classifiers

Probabilistic Reasoning in Deep Learning

Continuous-Time Flows for Efficient Inference and Density Estimation

CSCI-567: Machine Learning (Spring 2019)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Statistical Data Mining and Machine Learning Hilary Term 2016

Variational Inference via Stochastic Backpropagation

CONTINUOUS-TIME FLOWS FOR EFFICIENT INFER-

LEARNING TO SAMPLE WITH ADVERSARIALLY LEARNED LIKELIHOOD-RATIO

Data Mining Techniques

Probabilistic Graphical Models

Introduction to Machine Learning (67577)

Chapter 20. Deep Generative Models

Lecture 1: Supervised Learning

Support Vector Machines

Importance Reweighting Using Adversarial-Collaborative Training

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Variational Autoencoders

Midterm, Fall 2003

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

topics about f-divergence

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Advanced computational methods X Selected Topics: SGD

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS

Dropout as a Bayesian Approximation: Insights and Applications

UNSUPERVISED LEARNING

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Gradient descent GAN optimization is locally stable

Accelerating Stochastic Optimization

arxiv: v3 [cs.lg] 2 Nov 2018

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

CSC2541 Lecture 5 Natural Gradient

Latent Variable Models and EM algorithm

CSCI567 Machine Learning (Fall 2018)

Density Estimation. Seungjin Choi

Lecture 14: Deep Generative Learning

Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications

Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

A brief introduction to Conditional Random Fields

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Machine learning - HT Maximum Likelihood

ECS171: Machine Learning

Policy Gradient. U(θ) = E[ R(s t,a t );π θ ] = E[R(τ);π θ ] (1) 1 + e θ φ(s t) E[R(τ);π θ ] (3) = max. θ P(τ;θ)R(τ) (6) P(τ;θ) θ log P(τ;θ)R(τ) (9)

arxiv: v1 [cs.lg] 20 Apr 2017

arxiv: v2 [cs.lg] 14 Sep 2017

Auto-Encoding Variational Bayes

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

1 Machine Learning Concepts (16 points)

Generative models for missing value completion

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Contrastive Divergence

Parameter learning in CRF s

Posterior Regularization

Machine Learning for NLP

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Generative Adversarial Networks, and Applications

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Normalization Techniques in Training of Deep Neural Networks

Neural Networks and Deep Learning

Normalization Techniques

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Variational Auto Encoders

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Robust Monte Carlo Methods for Sequential Planning and Decision Making

Recurrent Latent Variable Networks for Session-Based Recommendation

Transcription:

Energy-Based Generative Adversarial Network Energy-Based Generative Adversarial Network J. Zhao, M. Mathieu and Y. LeCun Learning to Draw Samples: With Application to Amoritized MLE for Generalized Adversarial Learning D. Wang and Q. Liu Presented by Chunyuan Li 1 / 12

Preliminaries: vanilla GAN 1 Real data distribution P r(x), where x X ; 2 Fake data distribution P g(x), implemented via a generator as x = G(z), z P (z), where z Z; Objective Discriminator D : X [0, 1] min G max D V (D, G) L d = E x Pr [log D(x)] E x Pg [log(1 D(x))] (1) D(x): the probability that x from the real data rather than generator. Generator G : Z X L g = E x Pg [log(1 D(x))] = E z P (z) [log(1 D(G(z)))] (2) 2 / 12

Ideas of the two papers 1 EBGAN An energy-based formulation for adversarial training Autoencoder is employed for the energy function Repelling force 2 SteinGAN Amortized SVGD algorithm A probabilistic version of EBGAN 3 / 12

EBGAN Discriminator D: X R. L d = E x Pr [D(x)] + E x Pg [max(0, m D(x))] (3) where m is a positive margin. Generator G : Z X L g = E x Pg [D(x)] = E z P (z) [D(G(z))] (4) 4 / 12

EBGAN: Autoencoder as the energy function Discriminator Generator D(x) = Dec(Enc(x)) x (5) L d = E x Pr [Dec(Enc(x)) x] + E x Pg [max(0, m Dec(Enc(x)) x )] (6) L g = E z P (z) Dec(Enc(G(z))) G(z) (7) Figure: EBGAN architecture. 5 / 12

Repelling regularizer L P T (S) = 1 bs(bs 1) i j i ( ) S i S j S i S j (8) bs: batch size S i : i-th sample in minibatch S on feature level Enc(G(z)) Decreasing similarity between the sample representations within the minibatch and thus makes them as mutually orthogonal as possible, and further increases the diversity of generated samples Regularized generator loss L g = L g + λl P T, λ = 0.1 (9) 6 / 12

SteinGAN: SVGD for P Idea: approximate target distribution with a set of particles SVGD updates the particles iteratively x i x i + ɛφ(x i ), where φ(x) is a particle gradient direction chosen to maximumly decrease the KL divergence: { φ = argmax d } [ɛφ] KL(P g P ) ɛ=0 (10) φ F dɛ where P g [ɛφ] denotes the approximated density of the updated particle F is the set of perturbation directions that we optimize over When F is RKHS, there exists a closed solution, approximated with n particles n φ (x i ) x i = 1 [ x log p(x)k(x, x i ) + xk(x, x i )] (11) n i=1 1st term: drive toward the high probability regions of P r 2nd term: repulsive force to encourage diversity 7 / 12

Amortize SVGD Disadvantage of SVGD: Generation/inference can not improve based on the experience from the past tasks Solution: Training a η-parameterized neural network f(η; z), z N (0, I) to mimic the SVGD dynamics. An incremental approach in which η is iteratively adjusted so that the change of network outputs fit the change in SVGD. η t+1 argmin η t+1 m f(η t+1 ; z i ) f(η t ; z i ) ɛ x i 2 (12) i=1 Combined with Taylor expansion f(η t+1 ; z i ) f(η t ; z i ) ηf(η t ; z i )(η t+1 η t ) η t+1 η t + ɛ m ηf(η t ; z i ) x i (13) i=1 This is a chain rule that backpropagates the Stein variational gradient to the network parameter η. 8 / 12

Maximum likelihood with energy functions The empirical data distribution as P r, and we build a ω-parameterized p(x ω) as our model, with the energy function: p(x ω) = exp{ f(x, ω) F (ω)} and F (ω) = log where p ω(x) is the marginal x under our model p ω(x) exp{ f(x, ω)} (14) The MLE for the entire dataset is: { } max L(ω; x) = log p(x ω) (15) ω P r Gradient: ωl = P r [ θ f(x, ω)] + p ω(x) [ ωf(x, ω)] (16) The expectation of 2nd term is intractable, P g is used as its approximation 9 / 12

Amortize SVGD for GAN Reformulate SteinGAN with the notations of EBGAN D(x) = f(ω, x) and G(z) = f(η, z) (17) 1 Discriminator (Update ω with SGD): L d = E x Pr [D(x)] E x Pg [D(x)] (18) Decreases the energy of the true data and increases the energy of the simulated data 2 Generator (Update η with Amortized SVGD): L g = E x Pg [D(x)] = E z P (z) [D(G(z))] (19) Decreases the energy of the simulated data to better fit with p(x ω). 3 Specification of D(x) and k(x, x ) D(x) = Dec(Enc(x)) x (20) k(x, x ) = exp( 1 h 2 Enc(x) Enc(x ) 2 ), h is bandwidth (21) 10 / 12

Inception Score (IS): Evaluation Metric Pretrained Inception model p(y x) is applied to every generated image. C1: For higher quality of a single image, the conditional p(y x) with lower entropy is expected. C2: For higher diversity of all images, the marginal ˆp(y) = E z P (z) p(y x = G(z)) with higher entropy is expected. Combined C1 and C2, IS is obtained as exp(e Pg [KL(p(y x) ˆp(y))]) (22) 11 / 12

Results 1 EBGAN MNIST dataset Test various architectures and optimization methods (512 experiments) 2 SteinGAN Cifar10 dataset Real Training Set DCGAN SteinGAN 9.848 7.368 7.428 12 / 12