Training an RBM: Contrastive Divergence. Sargur N. Srihari

Similar documents
Machine Learning Basics: Estimators, Bias and Variance

Introduction to Restricted Boltzmann Machines

CS Lecture 13. More Maximum Likelihood

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Tracking using CONDENSATION: Conditional Density Propagation

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Restricted Boltzmann Machines

The Origin of Deep Learning. Lili Mou Jan, 2015

UNSUPERVISED LEARNING

Contrastive Divergence

Deep unsupervised learning

Chapter 20. Deep Generative Models

Variational Inference. Sargur Srihari

Lecture 4 Towards Deep Learning

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Probabilistic Machine Learning

Support Vector Machines MIT Course Notes Cynthia Rudin

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Chapter 16. Structured Probabilistic Models for Deep Learning

Combining Classifiers

Greedy Layer-Wise Training of Deep Networks

Deep Boltzmann Machines

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Lecture 16 Deep Neural Generative Models

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Pattern Recognition and Machine Learning. Artificial Neural networks

Bayes Decision Rule and Naïve Bayes Classifier

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Need for Sampling in Machine Learning. Sargur Srihari

arxiv: v1 [cs.ne] 6 May 2014

Learning MN Parameters with Approximation. Sargur Srihari

Variational Inference and Learning. Sargur N. Srihari

Annealing Between Distributions by Averaging Moments

Boosting with log-loss

Estimating Parameters for a Gaussian pdf

How to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

How to do backpropagation in a brain

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Restricted Boltzmann Machines for Collaborative Filtering

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5,

Detection and Estimation Theory

Modeling Documents with a Deep Boltzmann Machine

An efficient way to learn deep generative models

Probability Distributions

Gradient Descent. Sargur Srihari

1 Proof of learning bounds

Bayesian inference for stochastic differential mixed effects models - initial steps

SPECTRUM sensing is a core concept of cognitive radio

Knowledge Extraction from DBNs for Images

Ch 12: Variations on Backpropagation

Undirected Graphical Models

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Intractable Likelihood Functions

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Variational Autoencoder

Pseudo-marginal Metropolis-Hastings: a simple explanation and (partial) review of theory

An Empirical Investigation of Minimum Probability Flow Learning Under Different Connectivity Patterns

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Pattern Recognition and Machine Learning. Artificial Neural networks

Non-Parametric Non-Line-of-Sight Identification 1

Learning Parameters of Undirected Models. Sargur Srihari

Neural Network Training

Basic Principles of Unsupervised and Unsupervised

Markov Chain Monte Carlo Methods

Density estimation. Computing, and avoiding, partition functions. Iain Murray

A Simple Regression Problem

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy

Inductive Principles for Restricted Boltzmann Machine Learning

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Research in Area of Longevity of Sylphon Scraies

Identical Maximum Likelihood State Estimation Based on Incremental Finite Mixture Model in PHD Filter

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

In this chapter, we consider several graph-theoretic and probabilistic models

Topic 5a Introduction to Curve Fitting & Linear Regression

W-BASED VS LATENT VARIABLES SPATIAL AUTOREGRESSIVE MODELS: EVIDENCE FROM MONTE CARLO SIMULATIONS

An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey Hinton

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Unsupervised Learning

Supervised Baysian SAR image Classification Using The Full Polarimetric Data

Exponential families also behave nicely under conditioning. Specifically, suppose we write η = (η 1, η 2 ) R k R p k so that

Using Graphs to Describe Model Structure. Sargur N. Srihari

Pattern Recognition and Machine Learning. Artificial Neural networks

Learning Parameters of Undirected Models. Sargur Srihari

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Learning and Evaluating Boltzmann Machines

Variational Inference (11/04/13)

Undirected Graphical Models: Markov Random Fields

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Transcription:

Training an RBM: Contrastive Divergence Sargur N. srihari@cedar.buffalo.edu

Topics in Partition Function Definition of Partition Function 1. The log-likelihood gradient 2. Stochastic axiu likelihood and contrastive divergence 3. Pseudolikelihood 4. Score atching and Ratio atching 5. Denoising score atching 6. Noise-contrastive estiation 7. Estiating the partition function 2

RBM definition Joint configuration (v,h) visible and hidden units has an energy (Hopfield 1982) E(v,h) = Network assigns a probability to every pair of hidden and visible vectors p(v,h) = 1 Z e E(v,h ) v i b j v i a i i visible j hidden i,j w ij RBM bias a i b j W v h Connections Hidden binary units Visible binary units Stochastic binary pixels v connected to stochastic binary feature detectors h using syetrically weighted connections where partition function Z is a su over all possible pairs of visible/hidden vectors Z = e E(v,h ) v,h Probability that network assigns to a visible vector v is p(v) = 1 Z h e E(v,h ) 3

Changing probability of iage Probability network assigns to a training iage is raised by adjusting weights and biases Lower the energy of that iage & raise energy of other iages Especially those that have low energies and ake a high contribution to the partition function Maxiu likelihood approach to deterine W, h, v Likelihood: P({v (1),..v (M ) }) = p(v () ) Log-likelihood: lnp({v (1),..v (M ) }) = ln p(v () ) = ln 1 ) E(v,h )( e Z = ln e E(v,h )( ) ln e E(v,h ) Derivative of the log-probability of a training vector wrt a weight: ln p(v) w ij = E data (v i ) E odel (v i ) p(v,h) = Learning rule for stochastic steepest ascent Δw ij = ε ( E data (v i ) E odel (v i )). where ε is the learning rate h h E(v,h) = v,h 1 Z e E(v,h ) a i i visible w i,j E(v,h) = v i p(v) = 1 Z d dx lnx = 1 x v i b j v i i hiddene h e E(v,h ) i,j w ij 4

Saples for Coputing Expectations Getting unbiased saples for E data (v i ) Given a rando training iage v, the binary state for each hidden unit is set to 1 with probability p( = 1 v) = σ b j + v i w ij Given a rando training iage v, the binary state v i for a visible unit is set to 1 with probability p(v i = 1 v) = σ ai + Getting unbiased saples for E odel (v i ) i j w ij Can be done by starting at a rando state of visible units and perforing Gibbs sapling for a long tie One iteration of alternating Gibbs sapling consists of updating all hidden units in parallel followed by updating all visible units 5

Suary of RBM training Probability Distribution of Undirected odel (Gibbs) p(x;θ) = 1 Z(θ)!p(x, θ) Z(θ) =!p(x, θ) For an RBM: x={v,h} g i = log p(x (i) ;θ) = log!p(x (i) ;θ) logz(θ) x θ={w,a,b} Deterine paraeters θ that axiize log-likelihood (negative loss) ax L({x (1),..x () };θ) = θ L({x (1),..x () };θ) = i log p(x (i) ;θ) log!p(x (i) ;θ) logz(θ) i i Intractable Partition function For stochastic gradient ascent, take derivatives: RBM b j a i W E(v,h) = h T Wv a T v b T h = W j,k v k a k v k b j θ θ + εg bias j k h Connections v p(v,h) = 1 Z exp( E(v,h)) W i,j E(v,h) = v i k Binary units Binary units j Derivative of positive phase: 1 i=1 log!p(x (i) ;θ) Suation is over saples fro the training set Since it is sued ties 1/ has no effect Derivative of negative phase: logz(θ) = E x~p(x ) log!p(x) E x~p(x ) log!p(x) = 1 i=1 log!p(x (i) ;θ) Suation is over saples fro the RBM An identity

Stochastic Maxiu Likelihood Naiive ipleentation of negative phase of training logz(θ) = E x~p(x ) log!p(x) = 1 Copute it by burning a set of Markov chains fro a rando initialization every tie the gradient is needed When learning is perfored by SGD this eans chains ust be burned in once per gradient step This approach leads to following training procedure i=1 log!p(x (i) ;θ) RBM bias b j a i W h Connections v Binary units Binary units 7

Naiive MCMC algorith Maxiizing log-likelihood with an intractable partition function using gradient ascent Positive phase Negative phase With rando initialization Gibbs saples fro the odel are used to deterine gradient update 8

MCMC: balance between forces Can view MCMC approach as trying to balance between two forces One pushing up on the odel distribution where the data occurs Another pushing down on the odel distribution where the odel saples occur Next figure illustrates this process The two forces correspond to Maxiizing log p~ Miniizing log Z 9

Phases of MCMC Algorith Positive Phase: We saple points fro the data distribution and push up on their unnoralized probability Negative Phase: We saple points fro the odel distribution and push down on their unnoralized probability. This counteracts positive phase s tendency to just add a constant to the unnoralized probability everywhere When data distribution and odel distribution are equal, the positive phase has the sae chance to push up at a point as the negative phase has to push down, When this happens, there is no longer any gradient (in expectation) and training ust terinate. 10

Interpreting the negative phase Negative phase involves drawing saples fro the odel s distribution Can think of it as finding points that the odel believes in strongly Because the negative phase acts to reduce the probability of those points, they are generally considered to represent the odel s incorrect beliefs about the world Referred to as hallucinations or fantasy particles 11

Neuroscientific analogy Negative phase proposed as explanation of dreaing in huans and other anials Brain aintains a probabilistic odel of the world Follows the gradient of log p~ while experiencing real events while awake Follows the negative gradient of log p~ to iniize log Z while sleeping and experiencing events sapled fro the present odel Not proven with neuroscientific experients In ML it is necessary to use positive and negative phases siultaneously Rather than separate periods of wakefulness and REM sleep 12

A design less expensive than MCMC Given positive/negative phases of learning We can design a less expensive alternative to naiive MCMC Main cost of naiive MCMC: Cost of burning-in the Markov chains fro a rando initialization at each step Natural solution: Initialize Markov chains fro a distribution that is very close to the odel distribution So that burn-in operation does not take any steps 13

Contrastive Divergence algorith Initializes the Markov chain at each step with saples fro the data distribution This is presented in algorith given next Obtaining saples fro data distribution is free Because they are already in the data set Initially data distribution is not close to odel distribution, so the negative phase is inaccurate Positive phase can increase odel s data probability After positive phase has tie to act, the odel distribution is closer to the data distribution And the negative phase starts to becoe accurate 14

Contrastive Divergence Algorith Deep Learning Using gradient ascent for optiization Positive phase Negative Phase: Gibbs saples fro the odel initialized with training data are used to deterine gradient update 15