Au-delà de la Machine de Boltzmann Restreinte Hugo Larochelle University of Toronto
Introduction Restricted Boltzmann Machines (RBMs) are useful feature extractors They are mostly used to initialize deep feed-forward neural networks Can the Boltzmann machine modeling framework be useful on its own?
Restricted Boltzmann Machine b j h hidden units (binary) bias W connections c k x input units (binary) Energy function : E(x, h) = h Wx c x b h = W jk h j x k c k x k jk k j b j h j Distribution: p(x, h) = exp( E(x, h))/z intractable in general
Inference x h p(h x) = j p(h j = 1 x) = p(h j x) 1 1 + exp( (b j + W j x)) = sigm(b j + W j x) h p(x h) = k p(x k h) x p(x k = 1 h) = 1 1 + exp( (c k + h W k )) = sigm(c k + h W k )
Inference + ++++ + h p(x) = h {0,1} H p(x, h) = h {0,1} H exp( E(x, h))/z x exp( E(x, h))/z = h {0,1} H h 1 {0,1} free energy = exp c x + = exp( F (x))/z F exp( E(x, h))/z h H {0,1} H j=1 log(1 + exp(b j + W j x)) /Z F
Inference: in general h marg { x unk : unknown variables, whose distribution we are interested in + + + + + + h x cond : known variables, on which we condition { x cond { x unk p(x unk x cond ) = x h marg : unknown variables, over which we marginalize exp(e(x h marg unk, x cond, h marg )) x exp(e(x unk h marg unk, x cond, h marg ))
Learning To train an RBM, we minimize the negative log-likelihood (NLL) of the data L gen (D train ) = 1 n x t D train log p(x t ) We proceed by stochastic gradient descent log p(x t ) θ = E h [ E(xt, h) θ ] x t [ ] E(x, h) E x,h θ { positive phase { negative phase
Contrastive Divergence (CD) (Hinton, Neural Computation, 2002) Idea: 1. replace expectation by an evaluation at a single point x, h 2. obtain x, h by MCMC sampling h t = h 0... 3. start sampling chain at x t h k = h p(h x) p(x h) x t x 1 x k = x
Contrastive Divergence (CD) (Hinton, Neural Computation, 2002) E h [ E(xt, h) θ ] x t E(x t, h t ) θ E x,h [ E(x, h) θ ] E( x, h) θ E(x, h) (x t, h t ) ( x, h)
Contrastive Divergence (CD) (Hinton, Neural Computation, 2002) CD-k: contrastive divergence with k Gibbs sampling steps In general, the bigger k is, the less biased is the estimate of the gradient In practice, k=1 works well For an RBM, the parameter updates won t actually need samples for h
Other learning algorithm: Persistent CD (PCD) (Tieleman, ICML2008) Idea: instead of always reinitializing the Gibbs chain to, use from the previous iteration instead x t x h t = h 0... h k = h p(h x) p(x h) x t x 1 x k = x
Other learning algorithm: Persistent CD (PCD) (Tieleman, ICML2008) Idea: instead of always reinitializing the Gibbs chain to, use from the previous iteration instead x t x h t = h 0... h k = h p(h x) p(x h) x 1 x k = x
Other learning algorithm: Persistent CD (PCD) (Tieleman, ICML2008) Idea: instead of always reinitializing the Gibbs chain to, use from the previous iteration instead x t x h t = h 0... h k = h p(h x) p(x h) x comes from previous iteration x 1 x k = x
Plan RBMs can initialize deep neural networks, but not a Boltzmann machine anymore Three extensions to RBMs: Classification RBM Restricted Boltzmann Forest Deep Boltzmann Machine (DBM)
Classification Restricted Boltzmann Machine b j h hidden units (binary) W connections c k x input units (binary) Energy function : E(x, h) = h Wx c x b h = W jk h j x k c k x k jk k j b j h j Distribution: p(x, h) = exp( E(x, h))/z
Classification Restricted Boltzmann Machine b j h hidden units (binary) W connections c k x input units (binary) Energy function : = h Wx c x b h = W jk h j x k c k x k jk k j b j h j Distribution:
Classification Restricted Boltzmann Machine b j h hidden units (binary) U W connections target units (multinomial) y 0 0 1 0 c k x input units (binary) Energy function : E(x, h, y) = h Wx c x b h h Uy a y = W jk h j x k c k x k b j h j jk k j jl U jl h j y l l a l y l Distribution: p(x, h, y) = exp( E(x, h, y))/z
Inference: in general h marg { x unk : unknown variables, whose distribution we are interested in + + + + + + h x cond : known variables, on which we condition { x cond { x unk p(x unk x cond ) = x h marg : unknown variables, over which we marginalize exp(e(x h marg unk, x cond, h marg )) x exp(e(x unk h marg unk, x cond, h marg ))
Discriminative training (DRBM) (Larochelle and Bengio, ICML2008) What if we ditional are only likelihood interested of the in classification: targets: L disc (D train ) = D train i=1 log p(y i x i ) Since p(y x) can be computed exactly, we can also compute the discriminative gradient exactly: log p(y i x i ) θ where F (y, x) = d y = [ ] θ F (y i, x i ) E y xi θ F (y, x i) ( ( Hn log 1 + exp c j + U jy + j=1 )) d W ji x i i=1
Hybrid generative/discriminative training (HDRBM) (Larochelle and Bengio, ICML2008) We can explore the space between generative and discriminative training with (Bouchard & Triggs, 2004) α L hybrid (D train ) = L disc (D train ) + αl gen (D train ) Perform two updates, one according to each criterion Generative criterion acts as a data-dependent regulariser
Experiment: MNIST Model Error RBM (λ = 0.005, n = 6000) 3.39% DRBM (λ = 0.05, n = 500) 1.81% RBM+NNet 1.41% HDRBM (α = 0.01, λ = 0.05, n = 1500 ) 1.28% Sparse HDRBM (idem + n = 3000, δ = 10 4 ) 1.16% SVM 1.40% NNet 1.93% Subset of filters learned by the HDRBM on the MNIST dataset Some filters act as edge detectors (first row), some other filters are more specific to a particular digit shape (second row)
Experiment: 20newsgroup Model Error RBM (λ = 0.0005, n = 1000) 24.9% DRBM (λ = 0.0005, n = 50) 27.6% RBM+NNet 26.8% HDRBM (α = 0.005, λ = 0.1, n = 1000 ) 23.8% SVM 32.8% NNet 28.2% Newsgroup Most influencial words comp.sys.ibm.pc.hardware dos, ide, adaptec, pc, config, irq, vlb, bios, scsi, esdi, comp.sys.mac.hardware apple, mac, quadra, powerbook, lc, pds, centris alt.atheism bible, atheists, benedikt, atheism, religion talk.politics.guns firearms, handgun, firearm, gun, rkba, concealed rec.sport.hockey playoffs, penguins, didn, playoff, game, out, play
Restricted Boltzmann Forest (Larochelle et al., Neural Computation, to appear) How to obtain a tractable density estimator from an RBM? Answer: constrain RBM such that we can compute Z! Z = h {0,1} H x {0,1} d exp( E(x, h)) Z = h {0,1} H exp( F (h))number of hidden units must be small
Restricted Boltzmann Forests (Larochelle et al., Neural Computation, to appear) Alternative: constrain activations using a tree h W x
Restricted Boltzmann Forests (Larochelle et al., Neural Computation, to appear) Conditioned on x, the distribution of each tree is independant Using belief propagation (sum-product) can: sample hidden units compute unit expectations (probabilities) sum out all hidden units
Experiments Dataset Nb. Set sizes Diff. NLL Diff. NLL Diff. NL inputs (Train,Valid,Test) RBM RBM RBForest mult. RBFore adult 123 5000,1414,26147 4.18 ± 0.11 4.15 4.12 ± 0.11 4.12 ± 0. connect-4 126 16000,4000,47557 0.75 ± 0.04-1.72 0.59 ± 0.04 0.05 0.59 ± 0. dna 180 1400,600,1186 1.29 ± 0.72 1.45 1.39 ± 0.73 0.67 1.39 ± 0. mushrooms 112 2000,500,5624-0.69 ± 0.13 0.042-0.69 ± 0.12 0.11 0.042 ± 0. nips-0-12 500 400,100,1240 12.65 ± 1.55 12.61 11.25 ± 1.55 12.61 ± 1. ocr-letter 128 32152,10000,10000-2.49 ± 0.44 0.99 3.78 ± 0.43 3.78 ± 0. rcv1 150 40000,10000,150000-1.29 ± 0.16-0.044 0.56 ± 0.16 0.56 ± 0. web 300 14000,3188,32561 0.78 ± 0.30 0.018-0.15 ± 0.31-0.15 ± 0.
Deep Boltzmann Machines (DBM) (Salakhutdinov and Hinton, AISTATS2009) Deep Belief Network Deep Boltzmann Machine h 3 W 3 h 2 W 2 h 1 W 1 v E(v, h; θ) = v W 1 h 1 h 1 W 2 h 2 h 2 W 3 h 3
Learning Learning under a DBM requires a similar gradient log P (v t ) θ = E h [ E(vt, h; θ) θ ] v t E v,h [ E(v, h; θ) θ ] { positive phase { negative phase We use Persistent CD for the negative phase The positive phase now must be approximated
Learning In positive phase, a variational approach is used Q MF (h v; µ) = F 1 F 2 F 3 j=1 k=1 m=1 Procedure for minimizing { KL(P (h v), Q MF (h v; µ)) q(h 1 j)q(h 2 k)q(h 3 m) where ( D µ 1 j σ i=1 ( F1 µ 2 k σ j=1 ( F2 µ 3 m σ k=1 q(h l i = 1) = µl i for l = 1, 2, 3. nd on the log-probability of the d W 1 ijv i + W 2 jkµ 1 j + W 3 kmµ 2 k F 2 k=1 ). F 3 m=1 W 2 jkµ 2 k ), W 3 kmµ 3 m ),
Pretraining h 2 h 2 Pretraining h 3 h 3 W 3 W 3 RBM RBM Deep Boltzmann Machine h 3 W 3 h 2 2W 2 h 1 h 1 W 1 W 1 RBM h 1 W 2 W 1 v v v
Efficient Learning of Deep Boltzmann Machines (Salakhutdinov and Larochelle, AISTATS2010) Unfortunately, the learning procedure described so far is quite slow The major bottleneck is variational inference, which requires around 25 passes up and down the network before converging
Efficient Learning of Deep Boltzmann Machines (Salakhutdinov and Larochelle, AISTATS2010) Deep Boltzmann Machine Idea: suppose that we have relatively good mean-field recognition net Could initialize meanfield with recognition net output R 3 2R 2 2R 1 W 3 W 2 W 1
Efficient Learning of Deep Boltzmann Machines (Salakhutdinov and Larochelle, AISTATS2010) Pseudocode positive phase negative phase { { ν 1. get prediction from recognition net 2. initialize mean-field to µ ν and do only a few steps of mean-field 3. do Persistent CD 4. update DBM parameters 5. update recognition net parameters 6. go back to 1 until converged
Experiments Classification Dataset no top-down process DBM inference procedures MF-0 MF-1 MFRec-1 MF-5 MFRec-5 MF-Full DBN SVM K-NN MNIST 1.38% 1.15% 1.00% 1.01% 0.96% 0.95% 1.17% 1.40% 3.09% OCR letters 8.68% 8.44% 8.40% 8.50% 8.48% 8.58% 9.68% 9.70% 18.92% NORB 9.32% 7.96% 7.62% 7.67% 7.46% 7.23% 8.31% 11.60% 18.40% Inference speed on NORB (per 100 examples) MF-Rec1: 0.69 sec MF-5: 3.20 sec MF-full: 16.00 sec Code is available online!
Conclusion Boltzmann machines provide a flexible framework for modeling data with latent variables Using tools from graphical models, it is possible to develop many extensions for the RBM Other variations: Conditional RBM (sequential data) Higher order RBM and more!
Merci!