Contrastive Divergence

Size: px

Start display at page:

Download "Contrastive Divergence"

Shawn Stewart
6 years ago
Views:

1 Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer Science TU Graz June 9, 2010

2 Contents 1 Theory 2 Argument 3 Contrastive divergence 4 Applications 5 Summary

3 Definitions Training data... X = x 1,..., x k Model parameters... Θ where p(x Θ) = 1 f (x Θ) (1) Z(Θ) Z(Θ) = f (x Θ)dx (2)

4 Estimating model parameters Find Θ which maximises probability of training data p(x Θ) = K k=1 1 Z(Θ) f (x k Θ) (3) or equivalently, by minimizing log p(x Θ) (denoted Energy function) E(X Θ) = log Z(Θ) 1 K K log f (x k Θ) (4) k=1

5 Model function: Gaussian Choose probability model function as PDF of normal distribution f (x Θ) = N (x µ, σ) (5) so that Θ = {µ, σ} E(X Θ) µ optimal µ is mean of training data E(X Θ) σ optimal σ is σ 2 of training data

6 Model function: Mixture of Gaussians Choose probability model function as sum of N normal distributions so that Θ = {µ 1,..., µ N, σ 1,..., σ N } f (x Θ) = N N (x µ i, σ i ) (6) i=1 E(X Θ) Θ i log Z(Θ) = log N (7) depends on other parameters Use expectation maximisation or gradient ascent

7 Model function: Product of Gaussians Choose probability model function as product of N normal distributions f (x Θ) = N N (x µ i, σ i ) (8) i=1 Z(Θ) is no longer a constant Z(Θ) = f (x Θ)dx is not tractable Numerical intergration of E(X Θ) is too costly

8 Why use a Product of Gaussians? Mixture models very inefficient in high-dimensional spaces posterior distribution can not be sharper than individual models but individual models have to be broadly tuned Product of Gaussians can not approximate arbitrary smooth distributions if individual model contains at least one latent variable expert that poses constraint hard to calculate derivatives

9 Contrastive divergence Minimize energy function that can not be evaluated Use CD to estimate gradient of energy function Find minimum by taking small steps in direction of steepest gradient

10 CD: Energy function E(X Θ) Θ = log Z(Θ) Θ 1 K K i=1 log f (x i Θ) Θ (9) log Z(Θ) = log f (x i Θ) X (10) Θ Θ where. X denotes expectation of. given data X

11 CD: Derivation of log Z(Θ) (see 4) log Z(Θ) = 1 Z(Θ) (11) Θ Z(Θ) Θ = 1 f (x Θ)dx (12) Z(Θ) Θ =. log f (x Θ) p(x Θ) (13) Θ

12 CD: Sampling Approximate derivative of log Z(Θ) by drawing samples from p(x Θ) which can not be drawn directly (Z(Θ) unkown) but use MCMC (e.g. Gibbs) sampling

13 Gibbs Sampling special case of Metropolitan-Hastings algorithm simpler to sample from a conditional distribution than to marginalize by integrating over a joint distribution e.g. sample p(x, y) starting with y 0, i = 1 x i p(x y = y i 1 ) y i p(y x = x i )

14 CD: MCMC Sampling use many cycles of MCMC sampling to transform training data (drawn from target distribution) into data drawn from the proposed distribution given the data, all hidden states can be updated in parallel (conditional independence)

15 CD: Gibbs sampling time 0: all hidden variables are updated with samples from their posterior distribution visible variables time 1: update visible variables to reproduce original data vector, update all hidden variables

16 CD: Energy function 2 X n... transformed training data (n cycles MCMC) such that X 0 X Substituting leads to E(X Θ) Θ = log f (x Θ) log f (x Θ) X Θ Θ X 0 (14)

17 CD: MCMC Sampling length many cycles of MCMC sampling still too costly Hinton s intuition few MCMC cycles ought to be enough after few iterations data moved from target to proposed distribution empirically found 1 cycle suffices

18 CD-1 To minimize energy function log f (x Θ) log f (x Θ) Θ t+1 = Θ t + η( Θ X 0 Θ X 1) (15) with step size η (chosen experimentally)

19 Simple example 15 unigauss experts given data, calculate posterior of selecting gaussian or uniform, calculate. X 0 stochastically select gaussian or uniform according to posterior. Compute normalised product of selected gaussians, sample from it to get a reconstructed vector in data space calculate. X 1

20 Simple example 2 1 each dot is data point 2 fitted with 15 uni-gaussian experts 3 ellipses show σ

RBM with CD: digits 1 500 hidden units 2 16x16 visible units 3 pixel intensities

21 RBM with CD: digits hidden units 2 16x16 visible units 3 pixel intensities [0, 1] examples 5 weights of 100 units shown 6 almost perfect reconstructions

22 Summary PoE may lead to better approximation than mixture models learning gradient is intractable to calculate, estimation via CD similiar to RBM learning

23 Thank you for your attention! Training Products of Experts by Minimizing Contrastive Divergence by Geoffrey E. Hinton, 2002 Notes on Contrastive Divergence by Oliver Woodford

24 Proof 1 log Z(Θ) = 1 f (x Θ)dx (16) Θ Z(Θ) Θ = 1 f (x Θ) dx (17) Z(Θ) Θ = 1 log f (x Θ) f (x Θ) dx (18) Z(Θ) Θ log f (x Θ) = p(x Θ) dx (19) Θ = log f (x Θ) p(x Θ) (20) Θ

Approximate inference in Energy-Based Models

CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models Geoffrey Hinton Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Energy-based