Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,

Similar documents
Chapter 20. Deep Generative Models

Generative Adversarial Networks

The Numerics of GANs

Lecture 14: Deep Generative Learning

Generative adversarial networks

Energy-Based Generative Adversarial Network

Ian Goodfellow, Staff Research Scientist, Google Brain. Seminar at CERN Geneva,

Generative Adversarial Networks

A Unified View of Deep Generative Models

Unsupervised Learning

Variational Autoencoders (VAEs)

GENERATIVE ADVERSARIAL LEARNING

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

The power of two samples in generative adversarial networks

CSC321 Lecture 20: Reversible and Autoregressive Models

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Deep Generative Models. (Unsupervised Learning)

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

The Success of Deep Generative Models

CSCI-567: Machine Learning (Spring 2019)

Supplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization

Variational Inference. Sargur Srihari

Latent Variable Models

Some theoretical properties of GANs. Gérard Biau Toulouse, September 2018

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

B PROOF OF LEMMA 1. Published as a conference paper at ICLR 2018

Deep unsupervised learning

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Machine Learning Summer 2018 Exercise Sheet 4

Lecture 16 Deep Neural Generative Models

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

attention mechanisms and generative models

Auto-Encoding Variational Bayes

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Deep Generative Models for Graph Generation. Jian Tang HEC Montreal CIFAR AI Chair, Mila

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

arxiv: v1 [cs.lg] 28 Dec 2017

Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks

Masked Autoregressive Flow for Density Estimation

Part 4: Conditional Random Fields

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 14. Sinan Kalkan

MMD GAN 1 Fisher GAN 2

Computational statistics

arxiv: v1 [stat.ml] 19 Jan 2018

Do you like to be successful? Able to see the big picture

arxiv: v1 [cs.lg] 20 Apr 2017

Generative Adversarial Networks, and Applications

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Bayesian Learning in Undirected Graphical Models

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Finding Optimal Strategies for Influencing Social Networks in Two Player Games. MAJ Nick Howard, USMA Dr. Steve Kolitz, Draper Labs Itai Ashlagi, MIT

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

A picture of the energy landscape of! deep neural networks

Importance Reweighting Using Adversarial-Collaborative Training

An Online Learning Approach to Generative Adversarial Networks

Learning to Sample Using Stein Discrepancy

Variational Autoencoders. Presented by Alex Beatson Materials from Yann LeCun, Jaan Altosaar, Shakir Mohamed

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017

Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.

Composite Functional Gradient Learning of Generative Adversarial Models. Appendix

Deep Feedforward Networks

Unsupervised Neural Nets

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Neural Networks: Optimization & Regularization

Neural Network Training

GANs. Machine Learning: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM GRAHAM NEUBIG

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

STA 4273H: Statistical Machine Learning

arxiv: v3 [stat.ml] 20 Feb 2018

Probabilistic Graphical Models & Applications

Generative MaxEnt Learning for Multiclass Classification

AdaGAN: Boosting Generative Models

Learning Tetris. 1 Tetris. February 3, 2009

Negative Momentum for Improved Game Dynamics

Variational Autoencoders

Variational Autoencoder

PATTERN CLASSIFICATION

Deep Belief Networks are compact universal approximators

Posterior Regularization

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

CSE 473: Artificial Intelligence Autumn Topics

Non Convex-Concave Saddle Point Optimization

topics about f-divergence

Regression with Numerical Optimization. Logistic

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Learning Energy-Based Models of High-Dimensional Data

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

EE-559 Deep learning 9. Autoencoders and generative models

Day 3 Lecture 3. Optimizing deep networks

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

UNSUPERVISED LEARNING

arxiv: v3 [cs.lg] 2 Nov 2018

Robustness and duality of maximum entropy and exponential family distributions

Machine Learning Basics: Maximum Likelihood Estimation

Semi-Supervised Learning with the Deep Rendering Mixture Model

Lecture 5: Logistic Regression. Neural Networks

Transcription:

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab, 2016-08-31

Generative Modeling Density estimation Sample generation Training examples Model samples

Maximum Likelihood = arg max E x pdata log p model (x )

Taxonomy of Generative Models Maximum Likelihood Direct GAN Explicit density Implicit density Tractable density Approximate density -Fully visible belief nets -NADE -MADE Variational -PixelRNN Variational autoencoder -Change of variables models (nonlinear ICA) Markov Chain Boltzmann machine Markov Chain GSN

Fully Visible Belief Nets Explicit formula based on chain rule: p model (x) =p model (x 1 ) Disadvantages: O(n) sample generation cost ny p model (x i x 1,...,x i 1 ) i=2 (Frey et al, 1996) Currently, do not learn a useful latent representation PixelCNN elephants (van den Ord et al 2016)

Y Change of Variables y = g(x) ) p x (x) =p y (g(x)) det @g(x) @x e.g. Nonlinear ICA (Hyvärinen 1999) Disadvantages: - Transformation must be invertible - Latent dimension must match visible dimension 64x64 ImageNet Samples Real NVP (Dinh et al 2016)

Variational Autoencoder (Kingma and Welling 2013, Rezende et al 2014) z log p(x) log p(x) D KL (q(z)kp(z x)) =E z q log p(x, z)+h(q) x Disadvantages: -Not asymptotically consistent unless q is perfect -Samples tend to have lower quality CIFAR-10 samples (Kingma et al 2016)

Boltzmann Machines p(x) = 1 Z exp ( Z = X x X z E(x, z)) exp ( E(x, z)) Partition function is intractable May be estimated with Markov chain methods Generating samples requires Markov chains too

GANs Use a latent code Asymptotically consistent (unlike variational methods) No Markov chains needed Often regarded as producing the best samples No good way to quantify this

Generator Network x = G(z; (G) ) z x -Must be differentiable - In theory, could use REINFORCE for discrete variables - No invertibility requirement - Trainable for any size of z - Some guarantees require z to have higher dimension than x - Can make x conditionally Gaussian given z but need not do so

Training Procedure Use SGD-like algorithm of choice (Adam) on two minibatches simultaneously: A minibatch of training examples A minibatch of generated samples Optional: run k steps of one player for every step of the other player.

Minimax Game J (D) = J (G) = 1 2 E x p data log D(x) J (D) 1 2 E z log (1 D (G(z))) -Equilibrium is a saddle point of the discriminator loss -Resembles Jensen-Shannon divergence -Generator minimizes the log-probability of the discriminator being correct

Non-Saturating Game J (D) = 1 2 E x p data log D(x) 1 2 E z log (1 D (G(z))) J (G) = 1 2 E z log D (G(z)) -Equilibrium no longer describable with a single loss -Generator maximizes the log-probability of the discriminator being mistaken -Heuristically motivated; generator can still learn even when discriminator successfully rejects all generator samples

Maximum Likelihood Game J (D) = 1 2 E x p data log D(x) 1 2 E z log (1 D (G(z))) J (G) = 1 2 E z exp 1 (D (G(z))) When discriminator is optimal, the generator gradient matches that of maximum likelihood ( On Distinguishability Criteria for Estimating Generative Models, Goodfellow 2014, pg 5)

Maximum Likelihood Samples

Discriminator Strategy Optimal D(x) for any p data (x) and p model (x) is always D(x) = A cooperative rather than Discriminator adversarial view of GANs: the discriminator tries to estimate the ratio of the data and model distributions, and informs the generator of its estimate in order to guide its improvements. p data (x) p data (x)+p model (x) x z Data Model distribution

Comparison of Generator Losses

DCGAN Architecture Most deconvs are batch normalized (Radford et al 2015)

DCGANs for LSUN Bedrooms (Radford et al 2015)

Vector Space Arithmetic - + = Man with glasses Man Woman Woman with Glasses

Mode Collapse Fully optimizing the discriminator with the generator held constant is safe Fully optimizing the generator with the discriminator held constant results in mapping all points to the argmax of the discriminator Can partially fix this by adding nearest-neighbor features constructed from the current minibatch to the discriminator ( minibatch GAN ) (Salimans et al 2016)

Minibatch GAN on CIFAR Training Data (Salimans et al 2016) Samples

Minibatch GAN on ImageNet (Salimans et al 2016)

Cherry-Picked Results

GANs Work Best When Output Entropy is Low this small bird has a pink breast and crown, and black primaries and secondaries. this magnificent fellow is almost all black with a red crest, and white cheek patch. the flower has petals that are bright pinkish purple with white stigma this white and yellow flower have thin white petals and a round yellow stamen Figure 1. Examples of generated images from text descriptio (Reed et al 2016)

Optimization and Games Optimization: find a minimum: Game: = argmin J( ) Player 1 controls (1) Player 2 controls (2) Player 1 wants to minimize J (1) ( (1), (2) ) Player 2 wants to minimize J (2) ( (1), (2) ) Depending on J functions, they may compete or cooperate.

Games optimization Example: (1) = (2) = {} J (1) ( (1), (2) )=J( (1) ) J (2) ( (1), (2) )=0

Nash Equilibrium No player can reduce their cost by changing their own strategy: 8 (1),J (1) ( (1), (2) ) J (1) ( (1), (2) ) 8 (2),J (2) ( (1), (2) ) J (2) ( (1), (2) ) In other words, each player s cost is minimal with respect to that player s strategy Finding Nash equilibria optimization (but not clearly useful)

Well-Studied Cases Finite minimax (zero-sum games) Finite mixed strategy games Continuous, convex games Differential games (lion chases gladiator)

Continuous Minimax Game Solution is a saddle point of V. Not just any saddle point: must specifically be a maximum for player 1 and a minimum for player 2

Local Differential Nash Equilibria r (i)j (i) ( (1), (2) )=0 Necessary: r 2 (i) J (i) ( (1), (2) ) is positive semi-definite Su cient: r 2 (i) J (i) ( (1), (2) ) is positive definite (Ratliff et al 2013)

Sufficient Condition for Simultaneous Gradient Descent to Converge! = apple r (1)J (1) ( (1), (2) ) r (2)J (2) ( (1), (2) ) The eigenvalues of r! must have positive real part: apple r 2 (1) J (1) r (1)r (2)J (2) r (2)r (1)J (1) r 2 (2) J (2) (I call this the generalized Hessian ) (Ratliff et al 2013)

Interpretation Each player s Hessian should have large, positive eigenvalues, expressing a strong preference to keep doing their current strategy The Jacobian of one player s gradient with respect to the other player s parameters should have smaller contributions to the eigenvalues, meaning each player has limited ability to change the other player s behavior at convergence Does not apply to GANs, so their convergence remains an open question

Equilibrium Finding Heuristics Keep parameters near their running average Periodically assign running average value to parameters Constrain parameters to lie near running average Add loss for deviation from running average

Stabilized Training

Other Games in AI Robust optimization / robust control for security/safety, e.g. resisting adversarial examples Domain-adversarial learning for domain adaptation Adversarial privacy Guided cost learning Predictability minimization

Conclusion GANs are generative models that use supervised learning to approximate an intractable cost function GANs can simulate many cost functions, including the one used for maximum likelihood Finding Nash equilibria in high-dimensional, continuous, non-convex games is an important open research problem