Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Similar documents
Latent Variable Models

Generative Adversarial Networks

Undirected Graphical Models

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Discrete Latent Variable Models

Lecture 16 Deep Neural Generative Models

Chapter 16. Structured Probabilistic Models for Deep Learning

3 : Representation of Undirected GM

Deep unsupervised learning

Deep Generative Models. (Unsupervised Learning)

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Chapter 20. Deep Generative Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Kyle Reing University of Southern California April 18, 2018

STA 4273H: Statistical Machine Learning

Lecture : Probabilistic Machine Learning

Variational Inference (11/04/13)

Contrastive Divergence

Learning MN Parameters with Approximation. Sargur Srihari

Bayesian Machine Learning - Lecture 7

13 : Variational Inference: Loopy Belief Propagation and Mean Field

The Origin of Deep Learning. Lili Mou Jan, 2015

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Machine Learning Summer School

Replicated Softmax: an Undirected Topic Model. Stephen Turner

TUTORIAL PART 1 Unsupervised Learning

Introduction to Restricted Boltzmann Machines

Learning Energy-Based Models of High-Dimensional Data

CPSC 540: Machine Learning

Unsupervised Learning

Intelligent Systems (AI-2)

Lecture 6: Graphical Models: Learning

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

13: Variational inference II

PATTERN RECOGNITION AND MACHINE LEARNING

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures

Intelligent Systems (AI-2)

From independent component analysis to score matching

Annealing Between Distributions by Averaging Moments

Greedy Layer-Wise Training of Deep Networks

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Unsupervised Learning

The Ising model and Markov chain Monte Carlo

CPSC 540: Machine Learning

Variational Inference and Learning. Sargur N. Srihari

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

How to do backpropagation in a brain

Bayesian Methods for Machine Learning

3 Undirected Graphical Models

Directed and Undirected Graphical Models

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Clustering and Gaussian Mixture Models

Deep Boltzmann Machines

Undirected Graphical Models: Markov Random Fields

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Learning Tetris. 1 Tetris. February 3, 2009

Linear Dynamical Systems

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

CSC 412 (Lecture 4): Undirected Graphical Models

UNSUPERVISED LEARNING

Probabilistic Graphical Models

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Density estimation. Computing, and avoiding, partition functions. Iain Murray

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

MAP Examples. Sargur Srihari

Expectation Propagation for Approximate Bayesian Inference

Based on slides by Richard Zemel

Self Supervised Boosting

Machine Learning for Data Science (CS4786) Lecture 24

Markov Random Fields

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Lecture 14. Clustering, K-means, and EM

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

An introduction to Variational calculus in Machine Learning

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Density Estimation. Seungjin Choi

Need for Sampling in Machine Learning. Sargur Srihari

A Note on the Expectation-Maximization (EM) Algorithm

Restricted Boltzmann Machines

Machine Learning Lecture 5

The Expectation-Maximization Algorithm

Log-Linear Models, MEMMs, and CRFs

Statistical learning. Chapter 20, Sections 1 4 1

Probabilistic Graphical Models

Variational Autoencoder

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

The Expectation Maximization or EM algorithm

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Probabilistic Graphical Models

arxiv: v2 [cs.ne] 22 Feb 2013

Transcription:

Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 13 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 1 / 21

Summary Story so far Representation: Latent variable vs. fully observed Objective function and optimization algorithm: Many divergences and distances optimized via likelihood-free (two sample test) or likelihood based methods Plan for today: Energy based models Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 2 / 21

Likelihood based learning Probability distributions p(x) are a key building block in generative modeling. Properties: 1 non-negative: p(x) 0 2 sum-to-one: x p(x) = 1 (or p(x)dx = 1 for continuous variables) Sum-to-one is key: Total volume is fixed: increasing p(x train ) guarantees that x train becomes relatively more likely (compared to the rest). Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 3 / 21

Parameterizing probability distributions Probability distributions p(x) are a key building block in generative modeling. Properties: 1 non-negative: p(x) 0 2 sum-to-one: x p(x) = 1 (or p(x)dx = 1 for continuous variables) Coming up with a non-negative function p θ (x) is not hard. For example: g θ (x) = f θ (x) 2 where f θ is any neural network g θ (x) = exp(f θ (x)) where f θ is any neural network Problem: g θ (x) 0 is easy, but g θ (x) might not sum-to-one. x g θ(x) = Z(θ) 1 in general, so g θ (x) not a valid probability mass function or density Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 4 / 21

Likelihood based learning Problem: g θ (x) 0 is easy, but g θ (x) might not be normalized Solution: 1 p θ (x) = Volume(g θ ) g 1 θ(x) = gθ (x)dx g θ(x) Then by definition, p θ (x)dx = 1. Typically, choose g θ (x) so that we know the volume analytically as a function of θ. For example, 1 g (µ,σ) (x) = e (x µ)2 2σ 2. The volume is: e x µ 2σ 2 dx = 2πσ 2. 2 g λ (x) = e λx. The volume is: + 0 e λx dx = 1 λ. 3 Etc. We can only choose functional forms g θ (x) that we can integrate analytically. This is very restrictive, but as we have seen, they are very useful as building blocks Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 5 / 21

Likelihood based learning Problem: g θ (x) 0 is easy, but g θ (x) might not be normalized Solution: 1 p θ (x) = Volume(g θ ) g 1 θ(x) = gθ (x)dx g θ(x) Typically, choose g θ (x) so that we know the volume analytically. More complex models can be obtained by combining these building blocks. Two main strategies: 1 Products of normalized objects p θ (x)p θ (x)(y): x y p θ(x)p θ (x)(y)dxdy = x p θ(x) p θ (x)(y)dy dx = x p θ(x)dx = 1 y }{{} =1 2 Mixtures of normalized objects αp θ (x) + (1 α)p θ (x) : x αp θ(x) + (1 α)p θ (x)dx = α + (1 α) = 1 How about using models where the volume /normalization constant is not easy to compute analytically? Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 6 / 21

Energy based model p θ (x) = 1 exp(fθ (x))dx exp(f θ(x)) = 1 Z(θ) exp(f θ(x)) The volume/normalization constant Z(θ) = exp(f θ (x))dx is also called the partition function. Why exponential (and not e.g. f θ (x) 2 )? 1 Want to capture very large variations in probability. log-probability is the natural scale we want to work with. Otherwise need highly non-smooth f θ. 2 Exponential families. Many common distributions can be written in this form. 3 These distributions arise under fairly general assumptions in statistical physics (maximum entropy, second law of thermodynamics). f θ (x) is called the energy, hence the name. Intuitively, configurations x with low energy (high f θ (x)) are more likely. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 7 / 21

Energy based model Pros: p θ (x) = 1 exp(fθ (x))dx exp(f θ(x)) = 1 Z(θ) exp(f θ(x)) 1 can plug in pretty much any function f θ (x) you want Cons (lots of them): 1 Sampling from p θ (x) is hard 2 Evaluating and optimizing likelihood p θ (x) is hard (learning is hard) 3 No feature learning (but can add latent variables) Curse of dimensionality: The fundamental issue is that computing Z(θ) numerically (when no analytic solution is available) scales exponentially in the number of dimensions of x. Nevertheless, some tasks do not require knowing Z(θ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 8 / 21

Applications of Energy based models p θ (x) = 1 exp(fθ (x))dx exp(f θ(x)) = 1 Z(θ) exp(f θ(x)) Given x, x evaluating p θ (x) or p θ (x ) requires Z(θ). However, their ratio p θ (x) p θ (x ) = exp(f θ(x) f θ (x )) does not involve Z(θ). This means we can easily check which one is more likely. Applications: 1 anomaly detection 2 denoising Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 9 / 21

Applications of Energy based models Given a trained model, many applications require relative comparisons. Hence Z(θ) is not needed. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 10 / 21

Example: Ising Model There is a true image y {0, 1} 3 3, and a corrupted image x {0, 1} 3 3. We know x, and want to somehow recover y. We model the joint probability distribution p(y, x) as p(y, x) = 1 Z exp ψ i (x i, y i ) + ψ ij (y i, y j ) i (i,j) E ψ i (x i, y i ): the i-th corrupted pixel depends on the i-th original pixel ψ ij (y i, y j ): neighboring pixels tend to have the same value How did the original image y look like? Solution: maximize p(y x) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 11 / 21

Example: Product of Experts Suppose you have trained several models q θ1 (x), r θ2 (x), t θ3 (x). They can be different models (PixelCNN, Flow, etc.) Each one is like an expert that can be used to score how likely an input x is. Assuming the experts make their judgments indpendently, it is tempting to ensemble them as p θ1 (x)q θ2 (x)r θ3 (x) To get a valid probability distribution, we need to normalize p θ1,θ 2,θ 3 (x) = 1 Z(θ 1, θ 2, θ 3 ) q θ 1 (x)r θ2 (x)t θ3 (x) Note: similar to an AND operation (e.g., probability is zero as long as one model gives zero probability), unlike mixture models which behave more like OR Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 12 / 21

Example: Restricted Boltzmann machine (RBM) RBM: energy-based model with latent variables Two types of variables: 1 x {0, 1} n are visible variables (e.g., pixel values) 2 z {0, 1} m are latent ones The joint distribution is ( p W,b,c (x, z) = 1 ) n (x Z exp T W z + bx + cz = 1 Z exp i=1 ) m x i z j w ij + bx + cz j=1 Restricted because there are no visible-visible and hidden-hidden connections, i.e., x i x j or z i z j terms in the objective Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 13 / 21

Applications: Deep Boltzmann Machines Stacked RBMs are one of the first deep generative models: Bottom layer variables v are pixel values. Layers above (h) represent higher-level features (corners, edges, etc). Early deep neural networks for supervised learning had to be pre-trained like this to make them work. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 14 / 21

Applications: Boltzmann Machines Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 15 / 21

Energy based models: learning and inference Pros: p θ (x) = 1 exp(fθ (x)) exp(f θ(x)) = 1 Z(θ) exp(f θ(x)) 1 can plug in pretty much any function f θ (x) you want Cons (lots of them): 1 Sampling is hard 2 Evaluating likelihood (learning) is hard 3 No feature learning Curse of dimensionality: The fundamental issue is that computing Z(θ) numerically (when no analytic solution is available) scales exponentially in the number of dimensions of x. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 16 / 21

Computing the normalization constant is hard As an example, the RBM joint distribution is p W,b,c (x, z) = 1 ) (x Z exp T W z + bx + cz where 1 x {0, 1} n are visible variables (e.g., pixel values) 2 z {0, 1} m are latent ones The normalization constant (the volume ) is Z(W, b, c) = ( ) exp x T W z + bx + cz x {0,1} n z {0,1} m Note: it is a well defined function of the parameters W, b, c, just hard to compute. Takes time exponential in n, m to compute. This means that evaluating the objective function p W,b,c (x, z) for likelihood based learning is hard. Optimizing the un-normalized probability exp ( x T W z + bx + cz ) is easy (w.r.t. trainable parameters W, b, c), but optimizing the likelihood p W,b,c (x, z) is also difficult.. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 17 / 21

Training intuition Goal: maximize f θ(x train ) Z(θ). Increase numerator, decrease denominator. Intuition: because the model is not normalized, increasing the un-normalized probability f θ (x train ) by changing θ does not guarantees that x train becomes relatively more likely (compared to the rest). We also need to take into account the effect on other wrong points and try to push them down to also make Z(θ) small. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 18 / 21

Contrastive Divergence Goal: maximize f θ(x train ) Z(θ) Idea: Instead of evaluating Z(θ) exactly, use a Monte Carlo estimate. Contrastive divergence algorithm: sample x sample p θ, take step on θ (f θ (x train ) f θ (x sample )). Make training data more likely than typical sample from the model. Recall comparisons are easy in energy based models! Looks simple, but wow to sample? Unfortunately, sampling is hard Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 19 / 21

Sampling from Energy based models p θ (x) = 1 exp(fθ (x)) exp(f θ(x)) = 1 Z(θ) exp(f θ(x)) No direct way to sample like in autoregressive or flow models. Main issue: cannot easily compute how likely each possible sample is However, we can easily compare two samples x, x. Use an iterative approach called Markov Chain Monte Carlo: 1 Initialize x 0 randomly, t = 0 2 Let x = x t + noise 1 If f θ (x ) > f θ (x t ), let x t+1 = x 2 Else let x t+1 = x with probability exp(f θ (x ) f θ (x t )) 3 Go to step 2 Works in theory, but can take a very long time to converge Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 20 / 21

Conclusion Energy-based models are another useful tool for modeling high-dimensional probability distributions. Very flexible class of models. Currently less popular because of computational issues. Energy based GANs: energy is represented by a discriminator. Contrastive samples (like in contrastive divergence) from a GAN-styke generator. Reference: LeCun et. al, A Tutorial on Energy-Based Learning [Link] Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 21 / 21