Chapter 20. Deep Generative Models

Similar documents
Chapter 16. Structured Probabilistic Models for Deep Learning

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Deep unsupervised learning

Introduction to Restricted Boltzmann Machines

Lecture 16 Deep Neural Generative Models

Learning to Disentangle Factors of Variation with Manifold Learning

Unsupervised Learning

Deep Boltzmann Machines

Generative adversarial networks

Greedy Layer-Wise Training of Deep Networks

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Modeling Documents with a Deep Boltzmann Machine

The Origin of Deep Learning. Lili Mou Jan, 2015

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

UNSUPERVISED LEARNING

Probabilistic Graphical Models

Variational Inference. Sargur Srihari

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Inductive Principles for Restricted Boltzmann Machine Learning

Restricted Boltzmann Machines

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Bayesian Machine Learning - Lecture 7

Variational Inference and Learning. Sargur N. Srihari

Contrastive Divergence

13: Variational inference II

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Gaussian Mixture Models

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Generative Adversarial Networks

Lecture 13 : Variational Inference: Mean Field Approximation

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

The Expectation-Maximization Algorithm

Variational Scoring of Graphical Model Structures

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto

STA 414/2104: Machine Learning

Generative Adversarial Networks

Deep Generative Models. (Unsupervised Learning)

Notes on Boltzmann Machines

Algorithmisches Lernen/Machine Learning

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Neural Network Training

Study Notes on the Latent Dirichlet Allocation

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Expectation Maximization

Learning Energy-Based Models of High-Dimensional Data

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Variational inference

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

Machine Learning Basics: Maximum Likelihood Estimation

Variational Autoencoders (VAEs)

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Machine Learning

Lecture 14: Deep Generative Learning

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Density estimation. Computing, and avoiding, partition functions. Iain Murray

The connection of dropout and Bayesian statistics

Expectation Maximization

Bayesian Learning in Undirected Graphical Models

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Pattern Recognition and Machine Learning

Probabilistic Graphical Models

STA 4273H: Statistical Machine Learning

Learning MN Parameters with Approximation. Sargur Srihari

CPSC 540: Machine Learning

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Replicated Softmax: an Undirected Topic Model. Stephen Turner

A Note on the Expectation-Maximization (EM) Algorithm

Lecture 8: Graphical models for Text

Learning Deep Architectures for AI. Part II - Vijay Chakilam

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning

Notes on Machine Learning for and

Density Estimation. Seungjin Choi

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

An Efficient Learning Procedure for Deep Boltzmann Machines

Latent Variable Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

A Unified View of Deep Generative Models

Basic Principles of Unsupervised and Unsupervised

CSC 412 (Lecture 4): Undirected Graphical Models

CPSC 540: Machine Learning

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Deep Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic Models in Theoretical Neuroscience

Annealing Between Distributions by Averaging Moments

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Restricted Boltzmann Machines for Collaborative Filtering

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

GENERATIVE ADVERSARIAL LEARNING

Transcription:

Peng et al.: Deep Learning and Practice 1 Chapter 20 Deep Generative Models

Peng et al.: Deep Learning and Practice 2 Generative Models Models that are able to Provide an estimate of the probability distribution function, p data, or Generate samples from a (likely implicit) distribution

Peng et al.: Deep Learning and Practice 3 Why Study Generative Models? Manipulation of high-dimensional, multi-modal distributions Potential uses in reinforcement learning, such as future state prediction Training with missing data (e.g. missing labels) and prediction on them Generation of realistic samples etc.

Peng et al.: Deep Learning and Practice 4 Taxonomy of Generative Models

Peng et al.: Deep Learning and Practice 5 Explicit density, p model (x; θ) Tractable (trained with the ordinary ML) Intractable/approximate (trained with approximate inference and/or MCMC approximations) Implicit density Single-step sample generation via a network Multi-step sample generation via Markov chains

Peng et al.: Deep Learning and Practice 6 Generative dversarial Networks (GN) differentiable generation network G, paired with a discriminator D for training Generator G maps latent noises z p(z) to visible variables x Conceptually, a graphical model with the same structure as VE x = G(z) can be regarded as a sample drawn from some p g (x) Generator is what we are concerned with

Peng et al.: Deep Learning and Practice 7 Discriminator D divides inputs into real and fake classes n ordinary binary classifier trained supervisedly Inputs are training examples (real) and generated samples (fake)

Peng et al.: Deep Learning and Practice 8

Peng et al.: Deep Learning and Practice 9 Training GNs: Two-Player Minimax Game D(x; θ (D) ), G(z; θ (G) ) are implemented with neural networks, and each has their own cost to minimize Discriminator cost (cross-entropy cost) J (D) (θ (D), θ (G) ) = E x pdata log D(x) E z pz log(1 D(G(z))) where D(x) denotes the probability of x being real Generator cost J (G) (θ (D), θ (G) ) = J (D) (θ (D), θ (G) ) Note that the sum of all players costs is zero (zero-sum game)

Peng et al.: Deep Learning and Practice 10 The entire game can be summarized with a value function V (θ (D), θ (G) ) J (D) (θ (D), θ (G) ) and the objective is to find a generator θ (G) = arg min θ (G) max θ (D) V (θ (D), θ (G) )

Peng et al.: Deep Learning and Practice 11 Optimization vs. Game The solution to an optimization problem is generally a local minimum of an objective function in parameter space, e.g. arg min θ (G),θ (D) V (θ (D), θ (G) ) where both θ (G), θ (D) are optimized simultaneously The solution to a game problem is generally a saddle point of an objective function in parameter space, e.g. arg min θ (G) max θ (D) V (θ (D), θ (G) ) where θ (G), θ (D) are optimized in turn by controlling one of them at a time with the other fixed

Peng et al.: Deep Learning and Practice 12 The Optimal Discriminator For a given generator G, the optimal discriminator is seen to be D G(x) = p data (x) p data (x) + p g (x) which can be obtained by having δ δd(x) J (D) (x) = 0 When given enough capacity, the discriminator obtains an estimate at every x p data(x) p g(x) This is the key that sets GNs apart from other generative models

Peng et al.: Deep Learning and Practice 13

Peng et al.: Deep Learning and Practice 14 The generator is to learn a model by following a discriminator uphill

Peng et al.: Deep Learning and Practice 15 The Optimal Generator Given DG (x) and enough capacity, the optimal generator is to minimize the Jensen-Shannon divergence between p data and p g arg min p g = arg min p g = arg min p g E x pdata log D G(x) + E x pg log(1 D G(G(x))) p data (x) E x pdata log p data (x) + p g (x) + E x p g log ( log(4) + KL p data p ) data + p g + KL 2 = arg min p g log(4) + 2 JSD(p data p g ) p g (x) p data (x) + p g (x) ( p g p ) data + p g 2 The minimum is achieved when p g = p data, i.e. JSD(p data p g ) = 0

Peng et al.: Deep Learning and Practice 16 Remarks The optimization is done w.r.t. p g directly The analysis for the discriminator is done w.r.t. D(x) Enough capacity in both contexts means that D G (x) and p g(x) can be implemented by D(x; θ (D) ) and G(z; θ (G) ), respectively

Peng et al.: Deep Learning and Practice 17 Implementation

Peng et al.: Deep Learning and Practice 18 (Convergence) If G and D have enough capacity, and at each step of lgorithm I, the discriminator is allowed to reach its optimum D G (x) given G, and p g is updated to improve the criterion (reduce the cost) E x pdata log D G(x) + E x pg log(1 D G(G(x))) then p g converges to p data Nothing is said about the convergence when optimization is done based on simultaneous stochastic gradient descent in parameter space

Peng et al.: Deep Learning and Practice 19 Toy problem Non-Convergence of Gradient Descent min x max y V (x, y) = xy x, y are optimized based on gradient descent with a tiny learning rate x(t + Δt) = x(t) Δt V (x(t), y(t)) x(t) y(t + Δt) = y(t) + Δt V (x(t), y(t)) y(t) This amounts to solving x (t) = y(t) y (t) = x(t) x (t) = x(t)

Peng et al.: Deep Learning and Practice 20 which has a solution of the form x(t) = x(0) cos(t) + y(0) sin(t) y(t) = x(0) sin(t) + y(0) cos(t) 2 (x(t), y(t)) (x(0), y(0)) (x, y ) 20 1 V (x, y) 0 y 0 1 20 4 2 0 2 4 5 x 0 5 y 2 2 1 0 1 2 x

Peng et al.: Deep Learning and Practice 21 Other Games Zero-sum game does not perform well in learning generator: gradients of J (G) w.r.t D(G(z)) vanish when the discriminator performs well

Peng et al.: Deep Learning and Practice 22 Heuristic, non-saturating game (to ensure non-zero gradients) J (G) = E z log D(G(z)) Maximum likelihood game (to minimize KL divergence) J (G) = E z exp(σ 1 (D(G(z)))

Peng et al.: Deep Learning and Practice 23 Mode Collapse Problem The generator learns to map different z to the same x Top: Data distribution (Mixture Gaussian) Bottom: Learned generator distribution over time The generator distribution produces only a single mode at a time and does not converge in this example This is acceptable in some applications but not all

Peng et al.: Deep Learning and Practice 24 Learned Representation The generator can learn a distributed representation that disentangles high-level concepts, e.g. gender vs. wearing glasses

Peng et al.: Deep Learning and Practice 25 DCGN There are many different implementations for generators, such as DCGN, LPGN, and more (study by yourself)

Peng et al.: Deep Learning and Practice 26 Deep Boltzmann Machines (DBM) n energy-based generative model with an explicit density over binary visible v and hidden h (1), h (2), h (3) variables where p(v, h (1), h (2), h (3) ; θ) = 1 Z(θ) exp( E(v, h(1), h (2), h (3) ; θ)) E(v, h (1), h (2), h (3) ; θ) = v T W (1) h (1) h (1)T W (2) h (2) h (2)T W (3) h (3) and θ = {W (1), W (2), W (3) } Note that bias terms are omitted for simplicity

Peng et al.: Deep Learning and Practice 27 Graphical model for DBM, where odd layers can be separated from even layers to reveal a bipartite structure s a result, variables in odd layers are conditionally independent given even layers and vice versa; this enables block Gibbs sampling

Peng et al.: Deep Learning and Practice 28 Likewise, it is seen that variables in a layer are conditionally independent given the neighbouring layers In the case of two hidden layers, we have p(v i = 1 h (1) ) = σ(w (1) i,: h (1) ) p(h (1) i = 1 v, h (2) ) = σ(v T W (1) :,i + W (2) i,: h (2) ) p(h (2) i = 1 h (1) ) = σ(h (1)T W (2) :,i ) However, the posterior distribution of all hidden layers given the visible layer does not factorize because of interactions between layers p(h (1), h (2) v) j p(h (1) j v) k p(h (2) k v) pproximate inference needs to be sought

Peng et al.: Deep Learning and Practice 29 DBM Mean Field Inference To construct a factorial Q(h v) for approximating p(h v) p(h (1), h (2) v) Q(h v) = j q(h (1) j v) k q(h (2) k v) In the present case, all hidden variables h (1) j, h (2) k are binary; these q(h v) must have a functional form of the Bernoulli distribution, i.e. q(h (1) j q(h (2) k v) = (ĥ(1) j v) = (ĥ(2) k ) h(1) j )h(2) (1 ĥ(1) j k (1 ĥ (2) k ) (1 h(1) j ), i )(1 h(2) k ), k where ĥ(1) j, ĥ(2) k [0, 1] are the corresponding parameters

Peng et al.: Deep Learning and Practice 30 Carrying out the expectation (needs some work) q j (h j v) = exp(e q j (log p(v, h (1), h (2) ; θ))) yields the following fixed-point update equations ( ĥ (1) j = σ v i W (1) i,j + i k W (2) j,k ĥ(2) k ), j ĥ (2) k = σ j W (2) j,k ĥ(1) j, k

Peng et al.: Deep Learning and Practice 31 DBM Parameter Learning DBM learning has to confront both the intractable inference p(h v) and the intractable partition function Z(θ) Combined variational inference, learning, and MCMC is necessary The objective then becomes to find W (1), W (2) that minimize L(Q, θ) = v i W (1) i,j ĥ(1) j + ĥ (1) j W (2) j,k ĥ(2) k log Z(θ)+H(Q) i j j which can be done via gradient descent (study lgorithm 20.1) θ = θ ε θ L(Q, θ) k In general, layer-wise pre-training is needed to arrive at a good model

Peng et al.: Deep Learning and Practice 32

Peng et al.: Deep Learning and Practice 33 Topics Not Covered Optimization for training deep models (Chapter 8) Representation learning (Chapter 15) Back-prop through random operations (REINFORCE, Chapter 20) BM for real-valued data (Chapter 20) Generative Stochastic Networks (Chapter 20) Deep Belief Networks (Chapter 20) Other generative models (Chapter 20)