Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto

Similar documents
A graph contains a set of nodes (vertices) connected by links (edges or arcs)

The Origin of Deep Learning. Lili Mou Jan, 2015

Greedy Layer-Wise Training of Deep Networks

UNSUPERVISED LEARNING

Modeling Documents with a Deep Boltzmann Machine

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Density estimation. Computing, and avoiding, partition functions. Iain Murray

Deep unsupervised learning

Deep Boltzmann Machines

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Learning to Disentangle Factors of Variation with Manifold Learning

Learning Deep Architectures

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Introduction to Restricted Boltzmann Machines

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Knowledge Extraction from DBNs for Images

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines

Restricted Boltzmann Machines

Unsupervised Learning

An Efficient Learning Procedure for Deep Boltzmann Machines

arxiv: v2 [cs.ne] 22 Feb 2013

Lecture 16 Deep Neural Generative Models

Chapter 20. Deep Generative Models

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih

Chapter 16. Structured Probabilistic Models for Deep Learning

Restricted Boltzmann Machines

Learning Deep Boltzmann Machines using Adaptive MCMC

Joint Training of Partially-Directed Deep Boltzmann Machines

An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey Hinton

arxiv: v4 [cs.lg] 16 Apr 2015

Deep Belief Networks are compact universal approximators

TUTORIAL PART 1 Unsupervised Learning

Part 2. Representation Learning Algorithms

Learning Deep Architectures

Gaussian Cardinality Restricted Boltzmann Machines

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Restricted Boltzmann Machines for Collaborative Filtering

An Empirical Investigation of Minimum Probability Flow Learning Under Different Connectivity Patterns

STA 4273H: Statistical Machine Learning

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

Contrastive Divergence

Deep Learning & Neural Networks Lecture 2

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Annealing Between Distributions by Averaging Moments

How to do backpropagation in a brain

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Denoising Autoencoders

Deep Neural Networks

Pattern Recognition and Machine Learning

STA 414/2104: Machine Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Sum-Product Networks: A New Deep Architecture

Neural networks. Chapter 20. Chapter 20 1

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Discriminative Learning of Sum-Product Networks. Robert Gens Pedro Domingos

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

The Recurrent Temporal Restricted Boltzmann Machine

Gaussian-Bernoulli Deep Boltzmann Machine

Inductive Principles for Restricted Boltzmann Machine Learning

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Neural networks and optimization

ARestricted Boltzmann machine (RBM) [1] is a probabilistic

Deep Learning Autoencoder Models

arxiv: v2 [cs.lg] 16 Mar 2013

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

The connection of dropout and Bayesian statistics

Deep Generative Models. (Unsupervised Learning)

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

ABC-LogitBoost for Multi-Class Classification

Deep Belief Networks are Compact Universal Approximators

Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group

From perceptrons to word embeddings. Simon Šuster University of Groningen

Auto-Encoders & Variants

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,

Introduction to Convolutional Neural Networks (CNNs)

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Deep Learning Architectures and Algorithms

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

arxiv: v1 [cs.ne] 6 May 2014

Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines

Introduction to Neural Networks

Learning Energy-Based Models of High-Dimensional Data

Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines

Modeling Natural Images with Higher-Order Boltzmann Machines

Scaling Neighbourhood Methods

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Variational Inference. Sargur Srihari

Learning Tetris. 1 Tetris. February 3, 2009

Multilayer Perceptron

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

How to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto

Inferring Sparsity: Compressed Sensing Using Generalized Restricted Boltzmann Machines. Eric W. Tramel. itwist 2016 Aalborg, DK 24 August 2016

Neural Networks and Deep Learning

Deep Belief Network Training Improvement Using Elite Samples Minimizing Free Energy

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Transcription:

Au-delà de la Machine de Boltzmann Restreinte Hugo Larochelle University of Toronto

Introduction Restricted Boltzmann Machines (RBMs) are useful feature extractors They are mostly used to initialize deep feed-forward neural networks Can the Boltzmann machine modeling framework be useful on its own?

Restricted Boltzmann Machine b j h hidden units (binary) bias W connections c k x input units (binary) Energy function : E(x, h) = h Wx c x b h = W jk h j x k c k x k jk k j b j h j Distribution: p(x, h) = exp( E(x, h))/z intractable in general

Inference x h p(h x) = j p(h j = 1 x) = p(h j x) 1 1 + exp( (b j + W j x)) = sigm(b j + W j x) h p(x h) = k p(x k h) x p(x k = 1 h) = 1 1 + exp( (c k + h W k )) = sigm(c k + h W k )

Inference + ++++ + h p(x) = h {0,1} H p(x, h) = h {0,1} H exp( E(x, h))/z x exp( E(x, h))/z = h {0,1} H h 1 {0,1} free energy = exp c x + = exp( F (x))/z F exp( E(x, h))/z h H {0,1} H j=1 log(1 + exp(b j + W j x)) /Z F

Inference: in general h marg { x unk : unknown variables, whose distribution we are interested in + + + + + + h x cond : known variables, on which we condition { x cond { x unk p(x unk x cond ) = x h marg : unknown variables, over which we marginalize exp(e(x h marg unk, x cond, h marg )) x exp(e(x unk h marg unk, x cond, h marg ))

Learning To train an RBM, we minimize the negative log-likelihood (NLL) of the data L gen (D train ) = 1 n x t D train log p(x t ) We proceed by stochastic gradient descent log p(x t ) θ = E h [ E(xt, h) θ ] x t [ ] E(x, h) E x,h θ { positive phase { negative phase

Contrastive Divergence (CD) (Hinton, Neural Computation, 2002) Idea: 1. replace expectation by an evaluation at a single point x, h 2. obtain x, h by MCMC sampling h t = h 0... 3. start sampling chain at x t h k = h p(h x) p(x h) x t x 1 x k = x

Contrastive Divergence (CD) (Hinton, Neural Computation, 2002) E h [ E(xt, h) θ ] x t E(x t, h t ) θ E x,h [ E(x, h) θ ] E( x, h) θ E(x, h) (x t, h t ) ( x, h)

Contrastive Divergence (CD) (Hinton, Neural Computation, 2002) CD-k: contrastive divergence with k Gibbs sampling steps In general, the bigger k is, the less biased is the estimate of the gradient In practice, k=1 works well For an RBM, the parameter updates won t actually need samples for h

Other learning algorithm: Persistent CD (PCD) (Tieleman, ICML2008) Idea: instead of always reinitializing the Gibbs chain to, use from the previous iteration instead x t x h t = h 0... h k = h p(h x) p(x h) x t x 1 x k = x

Other learning algorithm: Persistent CD (PCD) (Tieleman, ICML2008) Idea: instead of always reinitializing the Gibbs chain to, use from the previous iteration instead x t x h t = h 0... h k = h p(h x) p(x h) x 1 x k = x

Other learning algorithm: Persistent CD (PCD) (Tieleman, ICML2008) Idea: instead of always reinitializing the Gibbs chain to, use from the previous iteration instead x t x h t = h 0... h k = h p(h x) p(x h) x comes from previous iteration x 1 x k = x

Plan RBMs can initialize deep neural networks, but not a Boltzmann machine anymore Three extensions to RBMs: Classification RBM Restricted Boltzmann Forest Deep Boltzmann Machine (DBM)

Classification Restricted Boltzmann Machine b j h hidden units (binary) W connections c k x input units (binary) Energy function : E(x, h) = h Wx c x b h = W jk h j x k c k x k jk k j b j h j Distribution: p(x, h) = exp( E(x, h))/z

Classification Restricted Boltzmann Machine b j h hidden units (binary) W connections c k x input units (binary) Energy function : = h Wx c x b h = W jk h j x k c k x k jk k j b j h j Distribution:

Classification Restricted Boltzmann Machine b j h hidden units (binary) U W connections target units (multinomial) y 0 0 1 0 c k x input units (binary) Energy function : E(x, h, y) = h Wx c x b h h Uy a y = W jk h j x k c k x k b j h j jk k j jl U jl h j y l l a l y l Distribution: p(x, h, y) = exp( E(x, h, y))/z

Inference: in general h marg { x unk : unknown variables, whose distribution we are interested in + + + + + + h x cond : known variables, on which we condition { x cond { x unk p(x unk x cond ) = x h marg : unknown variables, over which we marginalize exp(e(x h marg unk, x cond, h marg )) x exp(e(x unk h marg unk, x cond, h marg ))

Discriminative training (DRBM) (Larochelle and Bengio, ICML2008) What if we ditional are only likelihood interested of the in classification: targets: L disc (D train ) = D train i=1 log p(y i x i ) Since p(y x) can be computed exactly, we can also compute the discriminative gradient exactly: log p(y i x i ) θ where F (y, x) = d y = [ ] θ F (y i, x i ) E y xi θ F (y, x i) ( ( Hn log 1 + exp c j + U jy + j=1 )) d W ji x i i=1

Hybrid generative/discriminative training (HDRBM) (Larochelle and Bengio, ICML2008) We can explore the space between generative and discriminative training with (Bouchard & Triggs, 2004) α L hybrid (D train ) = L disc (D train ) + αl gen (D train ) Perform two updates, one according to each criterion Generative criterion acts as a data-dependent regulariser

Experiment: MNIST Model Error RBM (λ = 0.005, n = 6000) 3.39% DRBM (λ = 0.05, n = 500) 1.81% RBM+NNet 1.41% HDRBM (α = 0.01, λ = 0.05, n = 1500 ) 1.28% Sparse HDRBM (idem + n = 3000, δ = 10 4 ) 1.16% SVM 1.40% NNet 1.93% Subset of filters learned by the HDRBM on the MNIST dataset Some filters act as edge detectors (first row), some other filters are more specific to a particular digit shape (second row)

Experiment: 20newsgroup Model Error RBM (λ = 0.0005, n = 1000) 24.9% DRBM (λ = 0.0005, n = 50) 27.6% RBM+NNet 26.8% HDRBM (α = 0.005, λ = 0.1, n = 1000 ) 23.8% SVM 32.8% NNet 28.2% Newsgroup Most influencial words comp.sys.ibm.pc.hardware dos, ide, adaptec, pc, config, irq, vlb, bios, scsi, esdi, comp.sys.mac.hardware apple, mac, quadra, powerbook, lc, pds, centris alt.atheism bible, atheists, benedikt, atheism, religion talk.politics.guns firearms, handgun, firearm, gun, rkba, concealed rec.sport.hockey playoffs, penguins, didn, playoff, game, out, play

Restricted Boltzmann Forest (Larochelle et al., Neural Computation, to appear) How to obtain a tractable density estimator from an RBM? Answer: constrain RBM such that we can compute Z! Z = h {0,1} H x {0,1} d exp( E(x, h)) Z = h {0,1} H exp( F (h))number of hidden units must be small

Restricted Boltzmann Forests (Larochelle et al., Neural Computation, to appear) Alternative: constrain activations using a tree h W x

Restricted Boltzmann Forests (Larochelle et al., Neural Computation, to appear) Conditioned on x, the distribution of each tree is independant Using belief propagation (sum-product) can: sample hidden units compute unit expectations (probabilities) sum out all hidden units

Experiments Dataset Nb. Set sizes Diff. NLL Diff. NLL Diff. NL inputs (Train,Valid,Test) RBM RBM RBForest mult. RBFore adult 123 5000,1414,26147 4.18 ± 0.11 4.15 4.12 ± 0.11 4.12 ± 0. connect-4 126 16000,4000,47557 0.75 ± 0.04-1.72 0.59 ± 0.04 0.05 0.59 ± 0. dna 180 1400,600,1186 1.29 ± 0.72 1.45 1.39 ± 0.73 0.67 1.39 ± 0. mushrooms 112 2000,500,5624-0.69 ± 0.13 0.042-0.69 ± 0.12 0.11 0.042 ± 0. nips-0-12 500 400,100,1240 12.65 ± 1.55 12.61 11.25 ± 1.55 12.61 ± 1. ocr-letter 128 32152,10000,10000-2.49 ± 0.44 0.99 3.78 ± 0.43 3.78 ± 0. rcv1 150 40000,10000,150000-1.29 ± 0.16-0.044 0.56 ± 0.16 0.56 ± 0. web 300 14000,3188,32561 0.78 ± 0.30 0.018-0.15 ± 0.31-0.15 ± 0.

Deep Boltzmann Machines (DBM) (Salakhutdinov and Hinton, AISTATS2009) Deep Belief Network Deep Boltzmann Machine h 3 W 3 h 2 W 2 h 1 W 1 v E(v, h; θ) = v W 1 h 1 h 1 W 2 h 2 h 2 W 3 h 3

Learning Learning under a DBM requires a similar gradient log P (v t ) θ = E h [ E(vt, h; θ) θ ] v t E v,h [ E(v, h; θ) θ ] { positive phase { negative phase We use Persistent CD for the negative phase The positive phase now must be approximated

Learning In positive phase, a variational approach is used Q MF (h v; µ) = F 1 F 2 F 3 j=1 k=1 m=1 Procedure for minimizing { KL(P (h v), Q MF (h v; µ)) q(h 1 j)q(h 2 k)q(h 3 m) where ( D µ 1 j σ i=1 ( F1 µ 2 k σ j=1 ( F2 µ 3 m σ k=1 q(h l i = 1) = µl i for l = 1, 2, 3. nd on the log-probability of the d W 1 ijv i + W 2 jkµ 1 j + W 3 kmµ 2 k F 2 k=1 ). F 3 m=1 W 2 jkµ 2 k ), W 3 kmµ 3 m ),

Pretraining h 2 h 2 Pretraining h 3 h 3 W 3 W 3 RBM RBM Deep Boltzmann Machine h 3 W 3 h 2 2W 2 h 1 h 1 W 1 W 1 RBM h 1 W 2 W 1 v v v

Efficient Learning of Deep Boltzmann Machines (Salakhutdinov and Larochelle, AISTATS2010) Unfortunately, the learning procedure described so far is quite slow The major bottleneck is variational inference, which requires around 25 passes up and down the network before converging

Efficient Learning of Deep Boltzmann Machines (Salakhutdinov and Larochelle, AISTATS2010) Deep Boltzmann Machine Idea: suppose that we have relatively good mean-field recognition net Could initialize meanfield with recognition net output R 3 2R 2 2R 1 W 3 W 2 W 1

Efficient Learning of Deep Boltzmann Machines (Salakhutdinov and Larochelle, AISTATS2010) Pseudocode positive phase negative phase { { ν 1. get prediction from recognition net 2. initialize mean-field to µ ν and do only a few steps of mean-field 3. do Persistent CD 4. update DBM parameters 5. update recognition net parameters 6. go back to 1 until converged

Experiments Classification Dataset no top-down process DBM inference procedures MF-0 MF-1 MFRec-1 MF-5 MFRec-5 MF-Full DBN SVM K-NN MNIST 1.38% 1.15% 1.00% 1.01% 0.96% 0.95% 1.17% 1.40% 3.09% OCR letters 8.68% 8.44% 8.40% 8.50% 8.48% 8.58% 9.68% 9.70% 18.92% NORB 9.32% 7.96% 7.62% 7.67% 7.46% 7.23% 8.31% 11.60% 18.40% Inference speed on NORB (per 100 examples) MF-Rec1: 0.69 sec MF-5: 3.20 sec MF-full: 16.00 sec Code is available online!

Conclusion Boltzmann machines provide a flexible framework for modeling data with latent variables Using tools from graphical models, it is possible to develop many extensions for the RBM Other variations: Conditional RBM (sequential data) Higher order RBM and more!

Merci!