Training an RBM: Contrastive Divergence Sargur N. srihari@cedar.buffalo.edu
Topics in Partition Function Definition of Partition Function 1. The log-likelihood gradient 2. Stochastic axiu likelihood and contrastive divergence 3. Pseudolikelihood 4. Score atching and Ratio atching 5. Denoising score atching 6. Noise-contrastive estiation 7. Estiating the partition function 2
RBM definition Joint configuration (v,h) visible and hidden units has an energy (Hopfield 1982) E(v,h) = Network assigns a probability to every pair of hidden and visible vectors p(v,h) = 1 Z e E(v,h ) v i b j v i a i i visible j hidden i,j w ij RBM bias a i b j W v h Connections Hidden binary units Visible binary units Stochastic binary pixels v connected to stochastic binary feature detectors h using syetrically weighted connections where partition function Z is a su over all possible pairs of visible/hidden vectors Z = e E(v,h ) v,h Probability that network assigns to a visible vector v is p(v) = 1 Z h e E(v,h ) 3
Changing probability of iage Probability network assigns to a training iage is raised by adjusting weights and biases Lower the energy of that iage & raise energy of other iages Especially those that have low energies and ake a high contribution to the partition function Maxiu likelihood approach to deterine W, h, v Likelihood: P({v (1),..v (M ) }) = p(v () ) Log-likelihood: lnp({v (1),..v (M ) }) = ln p(v () ) = ln 1 ) E(v,h )( e Z = ln e E(v,h )( ) ln e E(v,h ) Derivative of the log-probability of a training vector wrt a weight: ln p(v) w ij = E data (v i ) E odel (v i ) p(v,h) = Learning rule for stochastic steepest ascent Δw ij = ε ( E data (v i ) E odel (v i )). where ε is the learning rate h h E(v,h) = v,h 1 Z e E(v,h ) a i i visible w i,j E(v,h) = v i p(v) = 1 Z d dx lnx = 1 x v i b j v i i hiddene h e E(v,h ) i,j w ij 4
Saples for Coputing Expectations Getting unbiased saples for E data (v i ) Given a rando training iage v, the binary state for each hidden unit is set to 1 with probability p( = 1 v) = σ b j + v i w ij Given a rando training iage v, the binary state v i for a visible unit is set to 1 with probability p(v i = 1 v) = σ ai + Getting unbiased saples for E odel (v i ) i j w ij Can be done by starting at a rando state of visible units and perforing Gibbs sapling for a long tie One iteration of alternating Gibbs sapling consists of updating all hidden units in parallel followed by updating all visible units 5
Suary of RBM training Probability Distribution of Undirected odel (Gibbs) p(x;θ) = 1 Z(θ)!p(x, θ) Z(θ) =!p(x, θ) For an RBM: x={v,h} g i = log p(x (i) ;θ) = log!p(x (i) ;θ) logz(θ) x θ={w,a,b} Deterine paraeters θ that axiize log-likelihood (negative loss) ax L({x (1),..x () };θ) = θ L({x (1),..x () };θ) = i log p(x (i) ;θ) log!p(x (i) ;θ) logz(θ) i i Intractable Partition function For stochastic gradient ascent, take derivatives: RBM b j a i W E(v,h) = h T Wv a T v b T h = W j,k v k a k v k b j θ θ + εg bias j k h Connections v p(v,h) = 1 Z exp( E(v,h)) W i,j E(v,h) = v i k Binary units Binary units j Derivative of positive phase: 1 i=1 log!p(x (i) ;θ) Suation is over saples fro the training set Since it is sued ties 1/ has no effect Derivative of negative phase: logz(θ) = E x~p(x ) log!p(x) E x~p(x ) log!p(x) = 1 i=1 log!p(x (i) ;θ) Suation is over saples fro the RBM An identity
Stochastic Maxiu Likelihood Naiive ipleentation of negative phase of training logz(θ) = E x~p(x ) log!p(x) = 1 Copute it by burning a set of Markov chains fro a rando initialization every tie the gradient is needed When learning is perfored by SGD this eans chains ust be burned in once per gradient step This approach leads to following training procedure i=1 log!p(x (i) ;θ) RBM bias b j a i W h Connections v Binary units Binary units 7
Naiive MCMC algorith Maxiizing log-likelihood with an intractable partition function using gradient ascent Positive phase Negative phase With rando initialization Gibbs saples fro the odel are used to deterine gradient update 8
MCMC: balance between forces Can view MCMC approach as trying to balance between two forces One pushing up on the odel distribution where the data occurs Another pushing down on the odel distribution where the odel saples occur Next figure illustrates this process The two forces correspond to Maxiizing log p~ Miniizing log Z 9
Phases of MCMC Algorith Positive Phase: We saple points fro the data distribution and push up on their unnoralized probability Negative Phase: We saple points fro the odel distribution and push down on their unnoralized probability. This counteracts positive phase s tendency to just add a constant to the unnoralized probability everywhere When data distribution and odel distribution are equal, the positive phase has the sae chance to push up at a point as the negative phase has to push down, When this happens, there is no longer any gradient (in expectation) and training ust terinate. 10
Interpreting the negative phase Negative phase involves drawing saples fro the odel s distribution Can think of it as finding points that the odel believes in strongly Because the negative phase acts to reduce the probability of those points, they are generally considered to represent the odel s incorrect beliefs about the world Referred to as hallucinations or fantasy particles 11
Neuroscientific analogy Negative phase proposed as explanation of dreaing in huans and other anials Brain aintains a probabilistic odel of the world Follows the gradient of log p~ while experiencing real events while awake Follows the negative gradient of log p~ to iniize log Z while sleeping and experiencing events sapled fro the present odel Not proven with neuroscientific experients In ML it is necessary to use positive and negative phases siultaneously Rather than separate periods of wakefulness and REM sleep 12
A design less expensive than MCMC Given positive/negative phases of learning We can design a less expensive alternative to naiive MCMC Main cost of naiive MCMC: Cost of burning-in the Markov chains fro a rando initialization at each step Natural solution: Initialize Markov chains fro a distribution that is very close to the odel distribution So that burn-in operation does not take any steps 13
Contrastive Divergence algorith Initializes the Markov chain at each step with saples fro the data distribution This is presented in algorith given next Obtaining saples fro data distribution is free Because they are already in the data set Initially data distribution is not close to odel distribution, so the negative phase is inaccurate Positive phase can increase odel s data probability After positive phase has tie to act, the odel distribution is closer to the data distribution And the negative phase starts to becoe accurate 14
Contrastive Divergence Algorith Deep Learning Using gradient ascent for optiization Positive phase Negative Phase: Gibbs saples fro the odel initialized with training data are used to deterine gradient update 15