Contrastive Divergence

Similar documents
Approximate inference in Energy-Based Models

Restricted Boltzmann Machines

Introduction to Restricted Boltzmann Machines

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Annealing Between Distributions by Averaging Moments

The Origin of Deep Learning. Lili Mou Jan, 2015

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Unsupervised Learning

Lecture 16 Deep Neural Generative Models

Learning Energy-Based Models of High-Dimensional Data

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Inductive Principles for Restricted Boltzmann Machine Learning

UNSUPERVISED LEARNING

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Deep Boltzmann Machines

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Latent Variable Models

Greedy Layer-Wise Training of Deep Networks

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Chapter 16. Structured Probabilistic Models for Deep Learning

Density Estimation. Seungjin Choi

Restricted Boltzmann Machines for Collaborative Filtering

19 : Slice Sampling and HMC

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Chapter 20. Deep Generative Models

Learning MN Parameters with Approximation. Sargur Srihari

arxiv: v2 [cs.ne] 22 Feb 2013

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Intractable Likelihood Functions

Auto-Encoding Variational Bayes

STA 4273H: Statistical Machine Learning

Learning to Disentangle Factors of Variation with Manifold Learning

Probabilistic Graphical Models

Probability and Information Theory. Sargur N. Srihari

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

13: Variational inference II

Deep unsupervised learning

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

How to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto

Latent Variable Models and EM algorithm

An introduction to Sequential Monte Carlo

Lecture 4: Dynamic models

Probabilistic Graphical Models & Applications

Learning Parameters of Undirected Models. Sargur Srihari

Restricted Boltzmann Machines

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Lecture 6: Graphical Models: Learning

Sampling Methods (11/30/04)

The Expectation Maximization or EM algorithm

Variational Inference. Sargur Srihari

Learning Parameters of Undirected Models. Sargur Srihari

Bayesian Mixtures of Bernoulli Distributions

How to do backpropagation in a brain

Week 3: The EM algorithm

Probabilistic Graphical Models

Information geometry for bivariate distribution control

COM336: Neural Computing

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Lecture 4: Probabilistic Learning

Machine Learning Summer School

Knowledge Extraction from DBNs for Images

Variational Inference and Learning. Sargur N. Srihari

Self Supervised Boosting

STA 4273H: Statistical Machine Learning

Variational Inference via Stochastic Backpropagation

Density estimation. Computing, and avoiding, partition functions. Iain Murray

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

A Review of Pseudo-Marginal Markov Chain Monte Carlo

A Generative Perspective on MRFs in Low-Level Vision Supplemental Material

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto

Kartik Audhkhasi, Osonde Osoba, Bart Kosko

Need for Sampling in Machine Learning. Sargur Srihari

An Empirical Investigation of Minimum Probability Flow Learning Under Different Connectivity Patterns

Learning Tetris. 1 Tetris. February 3, 2009

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Variational Autoencoder

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Lecture 4 Towards Deep Learning

Lecture 13 : Variational Inference: Mean Field Approximation

Variational Scoring of Graphical Model Structures

PATTERN RECOGNITION AND MACHINE LEARNING

A NEW VIEW OF ICA. G.E. Hinton, M. Welling, Y.W. Teh. S. K. Osindero

Gaussian Cardinality Restricted Boltzmann Machines

Replicated Softmax: an Undirected Topic Model. Stephen Turner

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

1 A Summary of Contrastive Divergence

The Expectation-Maximization Algorithm

Generative v. Discriminative classifiers Intuition

GWAS V: Gaussian processes

Adaptive Monte Carlo methods

Bayesian Methods for Machine Learning

Transcription:

Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer Science TU Graz June 9, 2010

Contents 1 Theory 2 Argument 3 Contrastive divergence 4 Applications 5 Summary

Definitions Training data... X = x 1,..., x k Model parameters... Θ where p(x Θ) = 1 f (x Θ) (1) Z(Θ) Z(Θ) = f (x Θ)dx (2)

Estimating model parameters Find Θ which maximises probability of training data p(x Θ) = K k=1 1 Z(Θ) f (x k Θ) (3) or equivalently, by minimizing log p(x Θ) (denoted Energy function) E(X Θ) = log Z(Θ) 1 K K log f (x k Θ) (4) k=1

Model function: Gaussian Choose probability model function as PDF of normal distribution f (x Θ) = N (x µ, σ) (5) so that Θ = {µ, σ} E(X Θ) µ optimal µ is mean of training data E(X Θ) σ optimal σ is σ 2 of training data

Model function: Mixture of Gaussians Choose probability model function as sum of N normal distributions so that Θ = {µ 1,..., µ N, σ 1,..., σ N } f (x Θ) = N N (x µ i, σ i ) (6) i=1 E(X Θ) Θ i log Z(Θ) = log N (7) depends on other parameters Use expectation maximisation or gradient ascent

Model function: Product of Gaussians Choose probability model function as product of N normal distributions f (x Θ) = N N (x µ i, σ i ) (8) i=1 Z(Θ) is no longer a constant Z(Θ) = f (x Θ)dx is not tractable Numerical intergration of E(X Θ) is too costly

Why use a Product of Gaussians? Mixture models very inefficient in high-dimensional spaces posterior distribution can not be sharper than individual models but individual models have to be broadly tuned Product of Gaussians can not approximate arbitrary smooth distributions if individual model contains at least one latent variable expert that poses constraint hard to calculate derivatives

Contrastive divergence Minimize energy function that can not be evaluated Use CD to estimate gradient of energy function Find minimum by taking small steps in direction of steepest gradient

CD: Energy function E(X Θ) Θ = log Z(Θ) Θ 1 K K i=1 log f (x i Θ) Θ (9) log Z(Θ) = log f (x i Θ) X (10) Θ Θ where. X denotes expectation of. given data X

CD: Derivation of log Z(Θ) (see 4) log Z(Θ) = 1 Z(Θ) (11) Θ Z(Θ) Θ = 1 f (x Θ)dx (12) Z(Θ) Θ =. log f (x Θ) p(x Θ) (13) Θ

CD: Sampling Approximate derivative of log Z(Θ) by drawing samples from p(x Θ) which can not be drawn directly (Z(Θ) unkown) but use MCMC (e.g. Gibbs) sampling

Gibbs Sampling special case of Metropolitan-Hastings algorithm simpler to sample from a conditional distribution than to marginalize by integrating over a joint distribution e.g. sample p(x, y) starting with y 0, i = 1 x i p(x y = y i 1 ) y i p(y x = x i )

CD: MCMC Sampling use many cycles of MCMC sampling to transform training data (drawn from target distribution) into data drawn from the proposed distribution given the data, all hidden states can be updated in parallel (conditional independence)

CD: Gibbs sampling time 0: all hidden variables are updated with samples from their posterior distribution visible variables time 1: update visible variables to reproduce original data vector, update all hidden variables

CD: Energy function 2 X n... transformed training data (n cycles MCMC) such that X 0 X Substituting leads to E(X Θ) Θ = log f (x Θ) log f (x Θ) X Θ Θ X 0 (14)

CD: MCMC Sampling length many cycles of MCMC sampling still too costly Hinton s intuition few MCMC cycles ought to be enough after few iterations data moved from target to proposed distribution empirically found 1 cycle suffices

CD-1 To minimize energy function log f (x Θ) log f (x Θ) Θ t+1 = Θ t + η( Θ X 0 Θ X 1) (15) with step size η (chosen experimentally)

Simple example 15 unigauss experts given data, calculate posterior of selecting gaussian or uniform, calculate. X 0 stochastically select gaussian or uniform according to posterior. Compute normalised product of selected gaussians, sample from it to get a reconstructed vector in data space calculate. X 1

Simple example 2 1 each dot is data point 2 fitted with 15 uni-gaussian experts 3 ellipses show σ

RBM with CD: digits 1 500 hidden units 2 16x16 visible units 3 pixel intensities [0, 1] 4 8000 examples 5 weights of 100 units shown 6 almost perfect reconstructions

Summary PoE may lead to better approximation than mixture models learning gradient is intractable to calculate, estimation via CD similiar to RBM learning

Thank you for your attention! Training Products of Experts by Minimizing Contrastive Divergence by Geoffrey E. Hinton, 2002 Notes on Contrastive Divergence by Oliver Woodford

Proof 1 log Z(Θ) = 1 f (x Θ)dx (16) Θ Z(Θ) Θ = 1 f (x Θ) dx (17) Z(Θ) Θ = 1 log f (x Θ) f (x Θ) dx (18) Z(Θ) Θ log f (x Θ) = p(x Θ) dx (19) Θ = log f (x Θ) p(x Θ) (20) Θ