Deep Boltzmann Machines

Similar documents
Modeling Documents with a Deep Boltzmann Machine

Restricted Boltzmann Machines

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

An Efficient Learning Procedure for Deep Boltzmann Machines

UNSUPERVISED LEARNING

Introduction to Restricted Boltzmann Machines

An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey Hinton

The Origin of Deep Learning. Lili Mou Jan, 2015

Deep unsupervised learning

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Lecture 16 Deep Neural Generative Models

Greedy Layer-Wise Training of Deep Networks

Chapter 20. Deep Generative Models

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Restricted Boltzmann Machines for Collaborative Filtering

Learning Deep Boltzmann Machines using Adaptive MCMC

Learning and Evaluating Boltzmann Machines

Knowledge Extraction from DBNs for Images

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Learning to Disentangle Factors of Variation with Manifold Learning

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Contrastive Divergence

How to do backpropagation in a brain

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep Neural Networks

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Chapter 16. Structured Probabilistic Models for Deep Learning

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

arxiv: v2 [cs.ne] 22 Feb 2013

An efficient way to learn deep generative models

Deep Learning Architecture for Univariate Time Series Forecasting

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Density estimation. Computing, and avoiding, partition functions. Iain Murray

Restricted Boltzmann Machines

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Jakub Hajic Artificial Intelligence Seminar I

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Unsupervised Learning

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines

Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Learning Tetris. 1 Tetris. February 3, 2009

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Kyle Reing University of Southern California April 18, 2018

Logistic Regression. COMP 527 Danushka Bollegala

Inductive Principles for Restricted Boltzmann Machine Learning

Lecture 4 Towards Deep Learning

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Parallel Tempering is Efficient for Learning Restricted Boltzmann Machines

Self Supervised Boosting

The Expectation-Maximization Algorithm

How to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto

Neural networks and optimization

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

arxiv: v1 [cs.lg] 8 Oct 2015

Lecture 5: Logistic Regression. Neural Networks

Gaussian Cardinality Restricted Boltzmann Machines

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Notes on Boltzmann Machines

ARestricted Boltzmann machine (RBM) [1] is a probabilistic

Comparison of Modern Stochastic Optimization Algorithms

STA 4273H: Statistical Machine Learning

An Empirical Investigation of Minimum Probability Flow Learning Under Different Connectivity Patterns

Reading Group on Deep Learning Session 1

INITIALIZING NEURAL NETWORKS USING RESTRICTED BOLTZMANN MACHINES. by Amanda Anna Erhard B.S. in Electrical Engineering, University of Pittsburgh, 2014

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Opportunities and challenges in quantum-enhanced machine learning in near-term quantum computers

Bayesian Methods for Machine Learning

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Probabilistic Graphical Models

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Learning Deep Architectures

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Deep Feedforward Networks

Intractable Likelihood Functions

Neural Networks: Backpropagation

Rapid Introduction to Machine Learning/ Deep Learning

Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

Learning Energy-Based Models of High-Dimensional Data

Auto-Encoding Variational Bayes

Probabilistic Models in Theoretical Neuroscience

arxiv: v2 [cs.lg] 16 Mar 2013

The Recurrent Temporal Restricted Boltzmann Machine

STA 414/2104: Machine Learning

CSC321 Lecture 20: Autoencoders

Stochastic gradient descent; Classification

Annealing Between Distributions by Averaging Moments

A Practical Guide to Training Restricted Boltzmann Machines

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Transcription:

Deep Boltzmann Machines Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel University of Illinois Urbana Champaign agoel10@illinois.edu December 2, 2016 Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 1 / 16

Overview 1 Introduction Representation of the model 2 Learning in Boltzmann Machines Variational Lower Bound - Mean Field Approximation Stochastic Approximation Procedure - Persistent Markov Chains 3 Additional Tricks for DBM Greedy Pretraining of the Model Discriminative Finetuning 4 Simulation results Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 2 / 16

Introduction Boltzmann Machine - Pairwise Markov Random Fields. Consider a set of random variables as latent i.e. hidden (h) and others as visible (v). The probability distribution for binary random variables is given by P θ (v, h) = 1 Z θ e E θ(v,h), θ = {L, J, W} E θ (v, h) = 1 2 vt Lv 1 2 ht Jh v T Wh, Figure: Model for Boltzmann Machines Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 3 / 16

Representation While Boltzmann Machine is a powerful model over the data, it is computationally expensive to learn. So, one considers several approximations to Boltzmann machines. Figure: Boltzmann Machines vs RBM Deep Boltzmann Machine consider hidden nodes in several layers, with a layer being units that have no direct connections. Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 4 / 16

Learning in Boltzmann Machines Model can be trained using Maximum Likelihood. The gradient of the likelihood takes the following form - ( ) ln(l θ (v)) = ln(p θ (v)) = ln p θ (v, h) h (1) uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 5 / 16

Learning in Boltzmann Machines Model can be trained using Maximum Likelihood. The gradient of the likelihood takes the following form - ( ) ln(l θ (v)) = ln(p θ (v)) = ln p θ (v, h) = ln h h exp( E θ (v, h)) ln exp( E θ (v, h)) ; (1) v,h uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 5 / 16

Learning in Boltzmann Machines Model can be trained using Maximum Likelihood. The gradient of the likelihood takes the following form - ( ) ln(l θ (v)) = ln(p θ (v)) = ln p θ (v, h) = ln h h exp( E θ (v, h)) ln exp( E θ (v, h)) ; (1) v,h ln(l θ (v)) θ = p(h v) E θ(v, h) + p(v, h) E θ(v, h) θ θ h v,h }{{}}{{} Data Dependent Expectation Model Dependent Expectation Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 5 / 16

Learning in Boltzmann Machines Using gradient ascent by substituting E θ (v, h) in the gradient obtained in previous equation, one can obtain the update for the respective parameters as, W = α(e Pdata [vh T ] E Pmodel [vh T ]), L = α(e Pdata [vv T ] E Pmodel [vv T ]), J = α(e Pdata [hh T ] E Pmodel [hh T ]), b = α(e Pdata [v] E Pmodel [v]), c = α(e Pdata [h] E Pmodel [h]), (2) The parameters updates in the M.L.E. is very costly in the previous steps as would need to sum over exponential number of terms to compute both expectations. One needs Approximations. Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 6 / 16

Approximate Maximum Likelihood Learning in Boltzmann Machines One approximation is to use a variational lower bound on the log-likelihoodln (p θ (v)) = ln p θ (v, h) = ln ( ) ( ) q µ (h v) q µ (h v) p θ(v i, h) h h (3) uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 7 / 16

Approximate Maximum Likelihood Learning in Boltzmann Machines One approximation is to use a variational lower bound on the log-likelihoodln (p θ (v)) = ln p θ (v, h) = ln ( ) ( ) q µ (h v) q µ (h v) p θ(v i, h) h h q µ (h v i )logp θ (v, h) + H e (q µ ) = L(q µ, θ) h (3) where q µ (h v) is an approximate posterior (variational) distribution and H e (.) is the entropy function with natural logarithm. uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 7 / 16

Approximate Maximum Likelihood Learning in Boltzmann Machines One approximation is to use a variational lower bound on the log-likelihoodln (p θ (v)) = ln p θ (v, h) = ln ( ) ( ) q µ (h v) q µ (h v) p θ(v i, h) h h q µ (h v i )logp θ (v, h) + H e (q µ ) = L(q µ, θ) h (3) where q µ (h v) is an approximate posterior (variational) distribution and H e (.) is the entropy function with natural logarithm. Try to find the tightest lowerbound on the log-likelihood by optimizing on the distributions q µ and parameters θ. uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 7 / 16

Variational Learning for Boltzmann Machines For Boltzmann Machines, the lower bound can be rewritten as (ignoring the bias terms) - L(q µ, θ) = h q µ (h v)( E θ (v, h)) log(z θ ) + H e (q µ ) (4) (5) uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 8 / 16

Variational Learning for Boltzmann Machines For Boltzmann Machines, the lower bound can be rewritten as (ignoring the bias terms) - L(q µ, θ) = h q µ (h v)( E θ (v, h)) log(z θ ) + H e (q µ ) (4) Using Mean Field Approximation, q µ (h v) = M j=1 q(h j v), and one assumes that q(h j = 1) = µ j. (M is the number of hidden units.) (5) uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 8 / 16

Variational Learning for Boltzmann Machines For Boltzmann Machines, the lower bound can be rewritten as (ignoring the bias terms) - L(q µ, θ) = h q µ (h v)( E θ (v, h)) log(z θ ) + H e (q µ ) (4) Using Mean Field Approximation, q µ (h v) = M j=1 q(h j v), and one assumes that q(h j = 1) = µ j. (M is the number of hidden units.) = h M i=1 ( 1 q µ (h i v i ) 2 vt Lv + 1 ) 2 ht Jh + v T Wh log(z θ ) + H e (q µ ) = 1 2 vt Lv + 1 2 µt Jµ + v T Wµ log(z θ ) + M H e (µ j ) j=1 (5) uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 8 / 16

Variational EM Learning for Boltzmann Machines Maximize lower bound iteratively by maximizing over the variational parameters µ and θ iteratively - Typical EM learning idea. E-step : sup µ L(q µ, θ) = sup µ 1 2 vt Lv + 1 2 µt Jµ + v T Wµ log(z θ ) + M j=1 H e(µ j ) uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 9 / 16

Variational EM Learning for Boltzmann Machines Maximize lower bound iteratively by maximizing over the variational parameters µ and θ iteratively - Typical EM learning idea. E-step : sup µ L(q µ, θ) = 1 sup µ 2 vt Lv + 1 2 µt Jµ + v T Wµ log(z θ ) + M j=1 H e(µ j ) Using alternate maximization over each variate, one gets the update µ j σ i W ij v i + J mj µ m, m j where σ(.) denotes the sigmoid function. After running these updates, the parameter µ converges to ˆµ. uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 9 / 16

Stochastic Approximations or Persistent Markov Chains M-step : sup θ L(q µ, θ) = sup θ 1 2 vt Lv+ 1 2 µt Jµ+v T Wµ log(z θ )+ M j=1 H e(µ j ) Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 10 / 16

Stochastic Approximations or Persistent Markov Chains M-step : sup θ L(q µ, θ) = sup θ 1 2 vt Lv+ 1 2 µt Jµ+v T Wµ log(z θ )+ M j=1 H e(µ j ) MCMC Sampling and Persistent Markov Chains to approximate gradient of log-partition function log(z θ ) Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 10 / 16

Stochastic Approximations or Persistent Markov Chains M-step : sup θ L(q µ, θ) = sup θ 1 2 vt Lv+ 1 2 µt Jµ+v T Wµ log(z θ )+ M j=1 H e(µ j ) MCMC Sampling and Persistent Markov Chains to approximate gradient of log-partition function log(z θ ) The parameter updates for one training example can be written as, ) N W = α t ([vˆµ T ] ṽ h T i, L = α t ([vv T ] J = α t ([ˆµˆµ T ] i=1 ) N ṽ h T i, i=1 ) N ṽ h T i, i=1 (6) Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 10 / 16

Overall Algorithm for Training Boltzmann Machines Data: Training set S n of N binary data vectors v and M, the number of persistent Markov chains Initialize vector θ 0 and M samples : {ṽ 0,1, h 0,1 },..., {ṽ 0,M, h 0,M }; for t =0 to T (number of iterations) do for each n S n do Randomly( initalize µ n and run updates to obtain ˆµ n µ j σ i W ijv i + ) m j J mjµ m end for m = 1 to M (number of persistent markov chains) do Sample (ṽ t+1,m, h t+1,m ) given (ṽ t+1,m, h t+1,m ) by running Gibbs sampler end Update θ using equation (6) (adjusting for batch data) and decrease the learning rate α t. end Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 11 / 16

Learning for Deep Boltzmann Machines For Deep Boltzmann Machines, L = 0 and J would have many zero-blocks as hidden unit interactions layered. So some computations simplified. Gibbs sampling procedure is simplified as all units in one layer can be sampled parallely. But, learning observed slow, and Greedy Pretraining can result in faster convergence of parameters. Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 12 / 16

Pretraining in Deep Boltzmann Machines Training each RBM separately, with some weight scaling. Figure: Greedy Layerwise Pretraining for DBM Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 13 / 16

Discriminative Finetuning in Deep Boltzmann Machines Further, an additional step of finetuning is also considered to improve the performance. For example, for a 2 hidden layer DBM, an approximate posterior is used as an augmented input to a neural network with weights of network initialized using parameters of DBM. Figure: Finetuning the parameters of DBM Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 14 / 16

Some Experimental Results and Observations Training a DBM for modeling handwritten digits in MNIST dataset. (a) DBM Model used for Training (b) Examples of handwritten digits Figure: An example of DBM used for MNIST data generation with training done for 60000 examples Some interesting observations :- Without Greedy Pretraining, the models were not producing good results. Using Discriminative fine tuning, DBM gave 99.5% accuracy, best on MNIST dataset for recognition at that time. Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 15 / 16

Thank You Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 16 / 16