CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

CSC321 Tutoral 9: Revew of Boltzmann machnes and smulated annealng (Sldes based on Lecture 16-18 and selected readngs) Yue L Emal: yuel@cs.toronto.edu Wed 11-12 March 19 Fr 10-11 March 21

Outlne Boltzmann Machnes Smulated Annealng Restrcted Boltzmann Machnes Deep learnng usng stacked RBM

General Boltzmann Machnes [1] h1 v1 h2 v2 Network s symmetrcally connected Allow connecton between vsble and hdden unts Each bnary unt makes stochastc decson to be ether on or off The confguraton of the network dctates ts energy At the equlbrum state of the network, the lkelhood s defned as the exponentated negatve energy known as the Boltzmann dstrbuton

Boltzmann Dstrbuton h1 h2 E(v, h) = s b + s s w (1) < P(v) = h exp( E(v, h)) v,h exp( E(v, h)) (2) v1 Two problems: v2 where v and h are vsble and hdden unts, w s are connecton weghts b/w vsble-vsble, hdden-hdden, and vsble-hdden unts, E(v, h) s the energy functon 1. Gven w, how to acheve thermal equlbrum of P(v, h) over all possble network confg. ncludng vsble & hdden unts 2. Gven v, learn w to maxmze P(v)

Thermal equlbrum s a dffcult concept (Lec 16): It does not mean that the system has settled down nto the lowest energy confguraton. The thng that settles down s the probablty dstrbuton over confguratons. Transton probabltes* at hgh temperature T: 0.01 0.1 0.02 A A B C B C 0.1 0.02 0.1 0.1 0.2 0.01 Thermal equlbrum A C Transton probabltes* at low temperature T A 1e-2 A 1e-3 B 1e-4 C B 1e-9 0.3 0 C 1e-8 1e-3 0.01 A B C *unnormalzed probabltes (llustraton only) B

Smulated annealng [2] Scale Boltzmann factor by T ( temperature ): P(s) = exp( E(s)/T ) exp( E(s)/T ) (3) s exp( E(s)/T ) where s = {v, h}. At state t + 1, a proposed state s s compared wth current state s t : P(s ( ) P(s t ) = exp E(s ) E(s t ) ( ) = exp E ) (4) T T s t+1 { s, s t f E < 0 or exp( E/T ) > rand(0,1) otherwse (5) NB: T controls the stochastc of the transton: when E > 0, T exp( E/T ) ; T exp( E/T )

A nce demo of smulated annealng from Wkpeda: http://www.cs.utoronto.ca/~yuel/csc321_utm_2014_ fles/hll_clmbng_wth_smulated_annealng.gf Note: smulated annealng s not used n the Restrcted Boltzmann Machne algorthm dscussed below. Instead, Gbbs samplng s used. Nonetheless, t s stll a nce concept and has been used n many many other applcatons (the paper by Krkpatrck et al. (1983) [2] has been has been cted for over 30,000 tmes based on Google Scholar!)

Learnng weghts from Boltzmann Machnes s dffcult N P(v) = h exp( E(v, h)) v,h exp( E(v, h)) = n n=1 h exp( s b < s s w ) v,h exp( s b < s s w ) log P(v) = ( log exp( s b s s w ) n h < log exp( s b ) s s w ) v,h < log P(v, h) w = n ( ) s s P(h v) s s P(v, h) s,s s,s =< s s > data < s s > model where < x > s the expected value of x. s, s {v, h}. < s s > model s dffcult or takes long tme to compute.

Restrcted Boltzmann Machne (RBM) [3] A smple unsupervsed learnng module; Only one layer of hdden unts and one layer of vsble unts; No connecton between hdden unts nor between vsble unts (.e. a specal case of Boltzmann Machne); Edges are stll undrected or b-drectonal e.g., an RBM wth 2 vsble and 3 hdden unts: h1 h2 h3 hdden v1 v2 nput

Obectve functon of RBM - maxmum lkelhood: E(v, h θ) = w v h + b v + b h p(v θ) = log p(v θ) = log p(v θ) w = N N p(v, h θ) = h exp( E(v, h θ)) h n=1 v,h exp( E(v, h θ)) N log exp( E(v, h θ)) log exp( E(v, h θ)) h v,h N v h p(h v) v h p(v, h) h v,h n=1 n=1 n=1 = E data [v h ] E model [ˆv ĥ ] < v h > data < ˆv ĥ > model But < ˆv ĥ > model s stll too large to estmate, we apply Markov Chan Monte Carlo (MCMC) (.e., Gbbs samplng) to estmate t.

<v h > 0 <v h > 1 <v h > a fantasy t = 0 t = 1 t = 2 t = nfnty shortcut <v h > 0 <v h > 1 t = 0 t = 1 data reconstructon log p(v 0 ) w =< h 0 (v 0 v 1 ) > + < v 1 (h 0 h 1 ) > + < h 1 (v 1 v 2 ) > +... =< v 0 h 0 > < v h > < v 0 h 0 > < v 1 h 1 >

How Gbbs samplng works <v h > 0 <v h > 1 t = 0 t = 1 data reconstructon 1. Start wth a tranng vector on the vsble unts 2. Update all the hdden unts n parallel 3. Update all the vsble unts n parallel to get a reconstructon 4. Update the hdden unts agan w = ɛ(< v 0 h 0 > < v 1 h 1 >) (6)

Approxmate maxmum lkelhood learnng log p(v) w 1 N N [ n=1 v (n) h (n) ˆv (n) ĥ (n) ] (7) where v (n) s the value of th vsble (nput) unt for n th tranng case; h (n) s the value of th hdden unt; ˆv (n) s the sampled value for the th vsble unt or the negatve data generated based on h (n) and w ; ĥ (n) s the sampled value for the th hdden unt or the negatve hdden actvtes generated based on ˆv (n) and w ; Stll how exactly the negatve data and negatve hdden actvtes are generated?

wake-sleep algorthm (Lec18 p5) 1. Postve ( wake ) phase (clamp the vsble unts wth data): Use nput data to generate hdden actvtes: h = 1 1 + exp( v w b ) Sample hdden state from Bernoull dstrbuton: { 1, f h > rand(0,1) h 0, otherwse 2. Negatve ( sleep ) phase (unclamp the vsble unts from data): Use h to generate negatve data: ˆv = 1 1 + exp( w h b ) Use negatve data ˆv to generate negatve hdden actvtes: ĥ = 1 1 + exp( ˆv w b )

RBM learnng algorthm (con td) - Learnng where w (t) b (t) b (t) = η w (t 1) = η b (t 1) = η b (t 1) log p(v θ) + ɛ w ( λw (t 1) ) w log p(v θ) + ɛ vb b log p(v θ) + ɛ hb b log p(v θ) w log p(v θ) b log p(v θ) b 1 N 1 N 1 N N [ n=1 N n=1 N n=1 v (n) [ v (n) [ h (n) h (n) ˆv (n) ] ˆv (n) ] ĥ(n) ĥ (n) ]

Deep learnng usng stacked RBM on mages [3] 2000 top-level unts 10 label unts Ths could be the top level of another sensory pathway 500 unts 500 unts 28 x 28 pxel mage A greedy learnng algorthm Bottom layer encode the 28 28 handwrtten mage The upper adacent layer of 500 hdden unts are used for dstrbuted representaton of the mages The next 500-unts layer and the top layer of 2000 unts called assocatve memory layers, whch have undrected connectons between them The very top layer encodes the class labels wth softmax

The network traned on 60,000 tranng cases acheved 1.25% test error on classfyng 10,000 MNIST testng cases On the rght are the ncorrectly classfed mages, where the predctons are on the top left corner (Fgure 6, Hnton et al., 2006)

Let model generate 28 28 mages for specfc class label Each row shows 10 samples from the generatve model wth a partcular label clamped on. The top-level assocatve memory s run for 1000 teratons of alternatng Gbbs samplng (Fgure 8, Hnton et al., 2006).

Look nto the mnd of the network Each row shows 10 samples from the generatve model wth a partcular label clamped on.... Subsequent columns are produced by 20 teratons of alternatng Gbbs samplng n the assocatve memory (Fgure 9, Hnton et al., 2006).

Deep learnng usng stacked RBM on handwrtten mages (Hnton et al., 2006) A real-tme demo from Prof. Hnton s webpage: http://www.cs.toronto.edu/~hnton/dgts.html

Further References Davd H Ackley, Geoffrey E Hnton, and Terrence J Senowsk. A learnng algorthm for boltzmann machnes. Cogntve scence, 9(1):147 169, 1985. S. Krkpatrck, C. D. Gelatt, and M. P. Vecch. Optmzaton by smulated annealng. Scence, 220(4598):671 680, 1983. G Hnton and S Osndero. A fast learnng algorthm for deep belef nets. Neural computaton, 2006.