Learning Energy-Based Models of High-Dimensional Data

Similar documents
Approximate inference in Energy-Based Models

Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation

The Origin of Deep Learning. Lili Mou Jan, 2015

How to do backpropagation in a brain

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Lecture 16 Deep Neural Generative Models

Deep unsupervised learning

STA 4273H: Statistical Machine Learning

The connection of dropout and Bayesian statistics

Learning Tetris. 1 Tetris. February 3, 2009

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets

Linear Dynamical Systems

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

How to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto

STA 4273H: Statistical Machine Learning

An efficient way to learn deep generative models

Contrastive Divergence

Learning Deep Architectures

Ways to make neural networks generalize better

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

STA 4273H: Statistical Machine Learning

Restricted Boltzmann Machines for Collaborative Filtering

STA 4273H: Statistical Machine Learning

Lecture 3a: The Origin of Variational Bayes

Bayesian Learning in Undirected Graphical Models

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

A NEW VIEW OF ICA. G.E. Hinton, M. Welling, Y.W. Teh. S. K. Osindero

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Chapter 16. Structured Probabilistic Models for Deep Learning

Energy-Based Models for Sparse Overcomplete Representations

Part 1: Expectation Propagation

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Approximate Bayesian Computation and Particle Filters

Probabilistic Graphical Models

Credit Assignment: Beyond Backpropagation

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Bayesian Networks BY: MOHAMAD ALSABBAGH

Modeling Natural Images with Higher-Order Boltzmann Machines

17 : Optimization and Monte Carlo Methods

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Knowledge Extraction from DBNs for Images

STA 414/2104: Machine Learning

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Machine Learning Techniques for Computer Vision

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Latent Variable Models

Variational Inference and Learning. Sargur N. Srihari

Is early vision optimised for extracting higher order dependencies? Karklin and Lewicki, NIPS 2005

Variational Inference. Sargur Srihari

Density estimation. Computing, and avoiding, partition functions. Iain Murray

Unsupervised Learning

Introduction to Restricted Boltzmann Machines

Introduction to Machine Learning

Hamiltonian Monte Carlo for Scalable Deep Learning

Recent Advances in Bayesian Inference Techniques

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Bayesian Methods for Machine Learning

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

13: Variational inference II

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Probabilistic Graphical Networks: Definitions and Basic Results

Deep Belief Networks are compact universal approximators

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih

Lecture 6: Graphical Models: Learning

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Probabilistic Machine Learning

p L yi z n m x N n xi

Expectation Propagation Algorithm

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

Lecture 7 and 8: Markov Chain Monte Carlo

Auto-Encoding Variational Bayes

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

STA 4273H: Sta-s-cal Machine Learning

Chapter 20. Deep Generative Models

Machine Learning Summer School

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Restricted Boltzmann Machines

CPSC 540: Machine Learning

Paul Karapanagiotidis ECO4060

Bayesian Learning in Undirected Graphical Models

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Gaussian Process Approximations of Stochastic Differential Equations

13 : Variational Inference: Loopy Belief Propagation and Mean Field

UNSUPERVISED LEARNING

ECE521 Lecture 19 HMM cont. Inference in HMM

Lecture 13 : Variational Inference: Mean Field Approximation

Bayesian Networks. Motivation

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines

Gaussian Cardinality Restricted Boltzmann Machines

Variational Scoring of Graphical Model Structures

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

State-Space Methods for Inferring Spike Trains from Calcium Imaging

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Logistic Regression & Neural Networks

Pattern Recognition and Machine Learning

Transcription:

Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal for unsupervised learning It is better to associate responses with the hidden causes than with the raw data. The hidden causes are useful for understanding the data. It would be interesting if real neurons really did represent independent hidden causes. 1

A different kind of hidden structure Instead of trying to find a set of independent hidden causes, try to find factors of a different kind. Capture structure by finding constraints that are Frequently Approximately Satisfied. Violations of FAS constraints reduce the probability of a data vector. If a constraint already has a big violation, violating it more does not make the data vector much worse (i.e. assume the distribution of violations is heavy-tailed.) Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Energy-based models that associate an energy with each data vector p ( d) = p( b) p( c b) p( d b, c) p( d) = e E( d ) E( c) e c Synthesis is easy Analysis can be hard Learning is easy after analysis Synthesis is hard Analysis is easy Is learning hard? 2

It is easy to generate an unbiased example at the leaf nodes. Bayes Nets Hidden cause It is typically hard to compute the posterior distribution over all possible configurations of hidden causes. Given samples from the posterior, it is easy to learn the local interactions Visible effect Approximate inference What if we use an approximation to the posterior distribution over hidden configurations? e.g. assume the posterior factorizes into a product of distributions for each separate hidden cause. If we use the approximation for learning, there is no guarantee that learning will increase the probability that the model would generate the observed data. But maybe we can find a different and sensible objective function that is guaranteed to improve at each update. 3

A trade-off between how well the model fits the data and the tractability of inference parameters data approximating posterior distribution true posterior distribution = θ ( Q( d) P( )) G( θ) log p( d ) KL d new objective function d How well the model fits the data The inaccuracy of inference This makes it feasible to fit very complicated models, but the approximations that are tractable may be very poor. Energy-Based Models with deterministic hidden units Use multiple layers of deterministic hidden units with non-linear activation functions. k Ek Hidden activities contribute additively to the global energy, E. e p ( d ) = data e c E ( d ) E ( c ) j Ej 4

Maximum likelihood learning is hard To get high log probability for d we need low energy for d and high energy for its main rivals, c log p( d) log p( d) = E( d) E( d) = + log c e c E( c) E( c) p( c) To sample from the model use Markov Chain Monte Carlo Hybrid Monte Carlo The obvious Markov chain makes a random perturbation to the data and accepts it with a probability that depends on the energy change. Diffuses very slowly over flat regions Cannot cross energy barriers easily In high-dimensional spaces, it is much better to use the gradient to choose good directions and to use momentum. Beats diffusion. Scales well. Can cross energy barriers. 5

Trajectories with different initial momenta Backpropagation can compute the gradient that Hybrid Monte Carlo needs 1. Do a forward pass computing hidden activities. 2. Do a backward pass all the way to the data to compute the derivative of the global energy w.r.t each component of the data vector. works with any smooth non-linearity k Ek j Ej data 6

The online HMC learning procedure 1. Start at a datavector, d, and use backprop to compute E(d) for every parameter. 2. Run HMC for many steps with frequent renewal of the momentum to get equilbrium sample, c. 3. Use backprop to compute E(c) 4. Update the parameters by : θ = ε ( E( d) + E( c) ) A surprising shortcut Instead of taking the negative samples from the equilibrium distribution, use slight corruptions of the datavectors. Only add random momentum once, and only follow the dynamics for a few steps. Much less variance because a datavector and its confabulation form a matched pair. Seems to be very biased, but maybe it is optimizing a different objective function. If the model is perfect and there is an infinite amount of data, the confabulations will be equilibrium samples. So the shortcut will not cause learning to mess up a perfect model. 7

Intuitive motivation It is silly to run the Markov chain all the way to equilibrium if we can get the information required for learning in just a few steps. The way in which the model systematically distorts the data distribution in the first few steps tells us a lot about how the model is wrong. But the model could have strong modes far from any data. These modes will not be sampled by confabulations. Is this a problem in practice? Contrastive divergence Aim is to minimize the amount by which a step toward equilibrium improves the data distribution. data distribution model s distribution distribution after one step of Markov chain CD = KL( P Q ) KL( Q Q 1 ) Minimize Contrastive Divergence Minimize divergence between data distribution and model s distribution Maximize the divergence between confabulations and model s distribution 8

0 KL( Q Q. ) Contrastive divergence = < E > Q0 + < E > Q 1 KL( Q Q ) = < E > Q1 + < E > Q 1 Q 1 KL( Q Q ) 1 Q Contrastive divergence makes the awkward terms cancel changing the parameters changes the distribution of confabulations Frequently Approximately Satisfied constraints The intensities in a typical image satisfy many different linear constraints very accurately, and violate a few constraints by a lot. The constraint violations fit a heavy-tailed distribution. The negative log probabilities of constraint violations can be used as energies. e n e r g y - + - 0 Violation On a smooth intensity patch the sides balance the middle Gauss Cauchy 9

Learning constraints from natural images (Yee-Whye Teh) We used 16x16 image patches and a single layer of 768 hidden units (3 x overcomplete). Confabulations are produced from data by adding random momentum once and simulating dynamics for 30 steps. Weights are updated every 100 examples. A small amount of weight decay helps. A random subset of 768 basis functions 10

The distribution of all 768 learned basis functions How to learn a topographic map Local Pooled squared filters connectivity The outputs of the linear filters are squared and locally pooled. This makes it cheaper to put filters that are violated at the same time next to each other. Linear filters Cost of second violation Global image connectivity Cost of first violation 11

Faster mixing chains Hybrid Monte Carlo can only take small steps because the energy surface is curved. With a single layer of hidden units, it is possible to use alternating parallel Gibbs sampling. Much less computation Much faster mixing Can be extended to use pooled second layer (Max Welling) Can only be used in deep networks by learning one hidden layer at a time. 12

13

Two views of Independent Components Analysis Stochastic Causal Generative models The posterior distribution is intractable. Posterior collapses ICA Z becomes determinant Deterministic Energy-Based Models I Partition function is intractable When the number of linear hidden units equals the dimensionality of the data, the model has both marginal and conditional independence. Density models Causal models Energy-Based Models Tractable posterior Intractable posterior Stochastic hidden units Deterministic hidden units mixture models, sparse bayes nets factor analysis Compute exact posterior Densely connected DAG s Markov Chain Monte Carlo or Minimize variational free energy Full Boltzmann Machine Full MCMC Restricted Boltzmann Machine Minimize contrastive divergence Markov Chain Monte Carlo Fix the features ( maxent ) or Minimize contrastive divergence 14

Where to find out more www.cs.toronto.edu/~hinton Energy-Based ICA Products of Experts has papers on: This talk is at www.cs.toronto.edu/~hinton 15