Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer Science TU Graz June 9, 2010
Contents 1 Theory 2 Argument 3 Contrastive divergence 4 Applications 5 Summary
Definitions Training data... X = x 1,..., x k Model parameters... Θ where p(x Θ) = 1 f (x Θ) (1) Z(Θ) Z(Θ) = f (x Θ)dx (2)
Estimating model parameters Find Θ which maximises probability of training data p(x Θ) = K k=1 1 Z(Θ) f (x k Θ) (3) or equivalently, by minimizing log p(x Θ) (denoted Energy function) E(X Θ) = log Z(Θ) 1 K K log f (x k Θ) (4) k=1
Model function: Gaussian Choose probability model function as PDF of normal distribution f (x Θ) = N (x µ, σ) (5) so that Θ = {µ, σ} E(X Θ) µ optimal µ is mean of training data E(X Θ) σ optimal σ is σ 2 of training data
Model function: Mixture of Gaussians Choose probability model function as sum of N normal distributions so that Θ = {µ 1,..., µ N, σ 1,..., σ N } f (x Θ) = N N (x µ i, σ i ) (6) i=1 E(X Θ) Θ i log Z(Θ) = log N (7) depends on other parameters Use expectation maximisation or gradient ascent
Model function: Product of Gaussians Choose probability model function as product of N normal distributions f (x Θ) = N N (x µ i, σ i ) (8) i=1 Z(Θ) is no longer a constant Z(Θ) = f (x Θ)dx is not tractable Numerical intergration of E(X Θ) is too costly
Why use a Product of Gaussians? Mixture models very inefficient in high-dimensional spaces posterior distribution can not be sharper than individual models but individual models have to be broadly tuned Product of Gaussians can not approximate arbitrary smooth distributions if individual model contains at least one latent variable expert that poses constraint hard to calculate derivatives
Contrastive divergence Minimize energy function that can not be evaluated Use CD to estimate gradient of energy function Find minimum by taking small steps in direction of steepest gradient
CD: Energy function E(X Θ) Θ = log Z(Θ) Θ 1 K K i=1 log f (x i Θ) Θ (9) log Z(Θ) = log f (x i Θ) X (10) Θ Θ where. X denotes expectation of. given data X
CD: Derivation of log Z(Θ) (see 4) log Z(Θ) = 1 Z(Θ) (11) Θ Z(Θ) Θ = 1 f (x Θ)dx (12) Z(Θ) Θ =. log f (x Θ) p(x Θ) (13) Θ
CD: Sampling Approximate derivative of log Z(Θ) by drawing samples from p(x Θ) which can not be drawn directly (Z(Θ) unkown) but use MCMC (e.g. Gibbs) sampling
Gibbs Sampling special case of Metropolitan-Hastings algorithm simpler to sample from a conditional distribution than to marginalize by integrating over a joint distribution e.g. sample p(x, y) starting with y 0, i = 1 x i p(x y = y i 1 ) y i p(y x = x i )
CD: MCMC Sampling use many cycles of MCMC sampling to transform training data (drawn from target distribution) into data drawn from the proposed distribution given the data, all hidden states can be updated in parallel (conditional independence)
CD: Gibbs sampling time 0: all hidden variables are updated with samples from their posterior distribution visible variables time 1: update visible variables to reproduce original data vector, update all hidden variables
CD: Energy function 2 X n... transformed training data (n cycles MCMC) such that X 0 X Substituting leads to E(X Θ) Θ = log f (x Θ) log f (x Θ) X Θ Θ X 0 (14)
CD: MCMC Sampling length many cycles of MCMC sampling still too costly Hinton s intuition few MCMC cycles ought to be enough after few iterations data moved from target to proposed distribution empirically found 1 cycle suffices
CD-1 To minimize energy function log f (x Θ) log f (x Θ) Θ t+1 = Θ t + η( Θ X 0 Θ X 1) (15) with step size η (chosen experimentally)
Simple example 15 unigauss experts given data, calculate posterior of selecting gaussian or uniform, calculate. X 0 stochastically select gaussian or uniform according to posterior. Compute normalised product of selected gaussians, sample from it to get a reconstructed vector in data space calculate. X 1
Simple example 2 1 each dot is data point 2 fitted with 15 uni-gaussian experts 3 ellipses show σ
RBM with CD: digits 1 500 hidden units 2 16x16 visible units 3 pixel intensities [0, 1] 4 8000 examples 5 weights of 100 units shown 6 almost perfect reconstructions
Summary PoE may lead to better approximation than mixture models learning gradient is intractable to calculate, estimation via CD similiar to RBM learning
Thank you for your attention! Training Products of Experts by Minimizing Contrastive Divergence by Geoffrey E. Hinton, 2002 Notes on Contrastive Divergence by Oliver Woodford
Proof 1 log Z(Θ) = 1 f (x Θ)dx (16) Θ Z(Θ) Θ = 1 f (x Θ) dx (17) Z(Θ) Θ = 1 log f (x Θ) f (x Θ) dx (18) Z(Θ) Θ log f (x Θ) = p(x Θ) dx (19) Θ = log f (x Θ) p(x Θ) (20) Θ