Gaussian-Bernoulli Deep Boltzmann Machine

Size: px

Start display at page:

Download "Gaussian-Bernoulli Deep Boltzmann Machine"

Milton Blair
5 years ago
Views:

1 Gaussan-Bernoull Deep Boltzmann Machne KyungHyun Cho, Tapan Rao and Alexander Iln Department of Informaton and Computer Scence, Aalto Unversty School of Scence Emal: Abstract In ths paper, we study a model that we call Gaussan-Bernoull deep Boltzmann machne (GDBM) and dscuss potental mprovements n tranng the model. GDBM s desgned to be applcable to contnuous data and t s constructed from Gaussan-Bernoull restrcted Boltzmann machne (GRBM) by addng multple layers of bnary hdden neurons. The studed mprovements of the learnng algorthm for GDBM nclude parallel temperng, enhanced gradent, adaptve learnng rate and layer-wse pretranng. We emprcally show that they help avod some of the common dffcultes found n tranng deep Boltzmann machnes such as dvergence of learnng, the dffculty n choosng rght learnng rate schedulng, and the exstence of meanngless hgher layers. I. ITRODUCTIO Deep Boltzmann machne (DBM) [] s a recent extenson of the smple restrcted Boltzmann machne (RBM) n whch several RBMs are staced on top of each other. Unle n other models, such as deep belef networ or deep autoencoders (see, e.g., [2], [3]), n DBM, each neuron n the ntermedate layers receves both top-down and bottom-up sgnals, whch facltates propagaton of uncertanty durng the nference procedure []. The orgnal DBM s constructed such that each vsble neuron represents a bnary varable, that s DBM learns dstrbutons over bnary vectors. A popular approach to modelng real-valued data s normalzngeachnputvarableto[0,]andtreatngtasaprobablty (e.g., usng a gray-scale value of a pxel as a probablty [2], [4]). Ths approach s however restrctve and t fts best to bounded varables. In the orgnal DBM [], real-valued data are frst transformed nto bnary codes by tranng a model called Gaussan-Bernoull RBM (GRBM) [2], and DBM s learned for the bnary codes extracted from the data. Ths approach showed promsng results [], [], [6] but t may be benefcal to combne GRBM and DBM n a sngle model and allow ther ont optmzaton. In ths paper, we study a Gaussan-Bernoull deep Boltzmann machne (GDBM) whch uses Gaussan unts n the vsble layer of DBM. Even though dervng stochastc gradent s rather easy for GDBM, the tranng procedure can easly run nto problems wthout careful selecton of the learnng parameters. Ths s largely caused by the fact that GRBM s nown to be dffcult to tune, especally the varance parameters of the vsble neurons (see, e.g., [7]). We propose an algorthm for tranng GDBM based on the mprovements ntroduced for tranng GRBM n [8]. Also, we dscuss the unversal approxmator property of GDBM. The rest of the paper s organzed as follows. The GDBM modelsntroducednsectonii.insectoniii,wepresentthe tranng algorthm and explan how to compute the terms of the stochastc gradent usng mean-feld approxmatons and parallel temperng samplng. In Secton IV-A, we descrbe the update rules that are nvarant to the data representaton and more robust to the ntalzaton of the parameters. In Secton IV-B, we descrbe how we adapt the learnng rate usng the deas proposed n [9]. We frst presented most of ths wor as an ntroducton to GDBMs n the IPS Worshop on Deep Learnng and Unsupervsed Feature Learnng []. However, ths worshop has no proceedngs, and therefore our paper presented there s not a proper publcaton. After ths, GDBM has been used n applcatons such as multmodal learnng [] and mage denosng [2]. II. GAUSSIA-BEROULLI DEEP BOLTZMA MACHIE GDBM wth a sngle vsble layer and L hdden layers s parameterzed wth weghts W of synaptc connectons between the vsble layer and the frst hdden layer, W (l) between layers l and l+, bases b of the vsble layer, b (l) of each hdden layer l, and standard-devatons σ of the vsble neurons.foracertanstate[v T h ()T h (L)T ] T,theenergy s defned as: E(v,h (),,h (L) θ) = v = (v b ) 2 v l= = 2σ 2 L l b (l) h(l) v σ 2 = = L l and the correspondng probablty s ( p v, {h (l)} ) L θ = l= Z(θ) exp l+ l= = = ( E h () w h (l) h(l+) w (l) ( v, {h (l)} L l= )) θ, where v and l are the number of neurons n the vsble layer and the l-th hdden layer, respectvely, and Z(θ) s the normalzng constant dependent on the parameters θ of the GDBM. ote that we use the GRBM parameterzaton, ncludng learnng z = logσ 2 nstead of σ drectly, from [8]. ote also that GRBM s a specal case of GDBM wth L =. The states of the neurons n the same layer are ndependent of each other gven the adacent upper and lower layers. The ()

2 condtonal probablty of a vsble neuron s p(v h (),θ) = v h () w +b,σ 2, = where ( µ,σ 2 ) s a probablty densty of ormal dstrbuton wth a mean µ and a standard devaton σ, and the condtonal probabltes of the hdden neurons are ) P(h () v,h (2),θ) = f P(h (l) ( v σ 2 = h(l ),h (l+),θ) = l f = 2 w + l+ h (l ) w (l ) + = = h (2) w() +b() h (l+) w (l) where f( ) s a sgmod functon and L+ = 0. A. GDBM s a unversal approxmator +b(l), GRBM belongs to the famly of mxture of Gaussans (MoG) snce ts ont dstrbuton can be factorzed nto p(v,h () ) = P(h () )p(v h () ) where P(h () ) s the mxture coeffcents and p(v h () ) s Gaussan. MoGs are nown to be unversal approxmators [3]. However, the center ponts of the exponentally many Gaussans n the data space are defned by only a lnear number of parameters wth respect to the number of hdden unts, so not all MoGs can be wrtten as a GRBM. Gven a MoG, we could transform t nto a GRBM f we further constran that exactly one of the hdden unts s actve at a tme, that s, h() =. We set the columns of W to match the center ponts of the MoG and b to 0, and b () s set such that the margnals P(h () ) would match the mxng coeffcents of the MoG. As the fnal step, we mplement the restrcton h() = = ω for each, w () = 3ω for all = and w () = ω for usng another hdden layer. We set 2 = and b (2) all, and further subtract ω from each b (). As ω goes to nfnty, t s clear that the probablty of all states where h (2) h () or h() goes to zero and other states mplement the prevous GRBM wth the wanted restrcton. We have thus shown that any MoG can be modeled wth a GDBM, and GDBM s a unversal approxmator. III. TRAIIG GDBM GDBM can be traned wth the stochastc maxmzaton of the lelhood, where the lelhood functon s computed by margnalzng out all the hdden neurons. For each parameter θ, the partal-dervatve of the log-lelhood functon s L θ ( E(v (t),h θ) ) θ d ( E(v,h θ)), θ m (2), where d and m denote the expectaton over the data dstrbuton P(h {v (t) },θ) and the model dstrbuton P(v,h θ), respectvely.{v (t) } s a set of all the tranng samples. A. Computng expectaton over the data dstrbuton Computng the frst term of (2) s straghtforward for restrcted Boltzmann machnes because n that model the hdden neurons are ndependent of each other gven the vsble neurons. However, ths does not apply to GDBM and therefore one needs to use some sort of approxmaton. We employ the mean-feld approxmaton that was used for tranng bnary DBM n []. In the mean-feld approxmaton, the vsble neurons are fxed to a tranng data sample, and the state of each hdden neuron h (l) s descrbed wth ts probablty µ (l) of beng actve, whch s updated wth the followng fxed-pont teratons: µ (l) f l = l+ µ (l ) w (l ) + = µ (l+) w (l) +b(l), = v /σ 2, and the update rule for ote that 0 = v, µ (0) the top-most layer not contan the summaton term l+ =. Usng the mean-feld approxmaton s nown to ntroduce a bas. However, ths s a computatonally effcent scheme to capture only one mode of the posteror dstrbuton, whch can be desrable n many practcal applcatons []. B. Computng expectaton over the model dstrbuton: Parallel Temperng approach The second term of the gradent (2) can be computed usng Marov-chan Monte-Carlo (MCMC) samplng. The orgnal approach proposed n [4] uses persstent Gbbs samplng wth only a few steps of samplng at each update. Ths s equvalent to persstent contrastve dvergence(pcd) ntroduced for tranng RBM []. Unfortunately the persstent Gbbs samplng suffers from poor mxng of samples, whch results n the fact that traned models may have probablty mass n the areas whch are not represented n the tranng data (false modes) [6], [4], [], [7]. In our experments, we were able to observe that learnng can easly dverge when persstent Gbbs samplng s used (see Secton V). We therefore use parallel temperng recently proposed n the context of RBM and GRBM [6], [8], [8] as a samplng procedure for GDBM. Parallel temperng overcomes the poormxng problem by mantanng multple chans of samplng wth dfferent temperatures. In the chan wth a hgh temperature, partcles are more lely to explore the state space easly, whereas partcles n the chan wth low temperatures more closely follow the target model dstrbuton. We defne the tempered dstrbutons by varyng parameters θ of the orgnal GDBM (). We denote by θ β the parameters of the ntermedate models defned by nverse temperatures β, where β = 0 corresponds to the base (most dffuse) dstrbuton and β = corresponds to the target dstrbuton

3 defned by the orgnal GDBM. Defnng approprate ntermedate dstrbutons s qute straghtforward for bnary RBM [6], [8] but t s not as trval for models wth real-valued vsble unts [8]. In ths wor, we use the temperng scheme defned by the followng choce of θ β : b β = βb+( β)m, b (l) β = βb(l) (l ), σ β, = βσ 2 +( β)s2, W (l) β = βw(l), where m = [m ] v = and s2 are the means and varances estmated from the samples obtaned from all the tempered dstrbutons. Thus, the base dstrbuton s the Gaussan dstrbuton ftted to the samples from all the ntermedate chans. The proposed scheme assures that the swappng happens even f the target dstrbuton dverges from the data dstrbuton. Accordng to our experments, ths results n more stable learnng compared to the scheme proposed n [8]. Adaptng the temperature durng learnng can mprove mxng and therefore facltate learnng [9]. In our experments, we adapt the temperatures so as to mantan the numbers of partcle swaps between the consecutve chans as equal as possble. Ths s done by adustng the nverse temperatures {β, =,..., chans } after every swappng round usng the followng heurstc: β (t) ηβ (t ) +( η) = n chans = n, where η s the dampng factor, n denote the number of swaps between the chans defned by β and β +, and the hottest chan s ept at the ntal temperature, that s β = 0. Ths smple approach does not have much computatonal overhead and seems to mprove learnng, as we show n Secton V. The proposed scheme of adaptng the base dstrbuton and the temperatures seem to wor well n practce, although t may ntroduce a bas. The analyss of the possble bases of ths approach s a queston for further research. IV. IMPROVIG THE TRAIIG PROCEDURE In ths secton, we show how to mprove tranng of GDBM by adaptng several deas ntroduced for tranng RBM and deep networs. A. Enhanced gradent for GDBM Enhanced gradent was ntroduced recently to mae the update rules of bnary Boltzmann machnes nvarant to data representaton [], [9]. The gradent was derved by ntroducng bt-flppng transformatons and mang the update rule, whch follows from (2), nvarant to such transformatons. It has been shown to mprove learnng of RBM by mang the results less senstve to the learnng parameters and ntalzaton. The same deas can be used for enhancng the gradent n the models wth Gaussan vsble neurons such as GRBM and GDBM. Instead of the bt-flppng transformatons, one can transform the orgnal model by shftng each vsble unt as ṽ = v and correctng the bas terms accordngly: b = () b + and b = b () w σ 2, whch would result n an equvalent model. Followng the methodology of [9], one can select the shftng parameters such that the resultng gradent wth respect to weghts w (l) and bases b (l) do not contan the same terms. Ths yelds the followng update rules: ( ) ( ) v e w = Cov d σ 2,h () v Cov m σ 2,h (), ( ( ) e w (l) = Cov d h (l),h (l+) ) Cov m h (l),h (l+) e b = b e b () = b () e b (l) = b (l) h () dm h (2) e w, h (l+) h (l ) dm dm e w () e w (l) ( l < L), e w (l ) (l > ), v dm e w, dm where h (l) = 2 h (l) + 2 h (l) s the average dm d m actvty of a neuron under the data and model dstrbutons and v dm = 2 v /σ 2 d + 2 v /σ 2 m. Cov P(, ) s a covarance between two varables under the dstrbuton P defned as Cov P (v,h ) = v h P v P h P. B. Adaptve learnng rate Thechoceofthelearnngratetobeusedwththestochastc gradent (2) greatly affects the tranng procedure [2], [22], [9]. In order to avod ths effect of choosng a learnng rate and ts schedulng, we adopt the strategy of automatc adaptaton of the learnng rate, as proposed n [9]. The adaptaton s done based on the estmate of the lelhood computed usng the dentty p(v d θ η ) = p (v d θ η ) Z(θ) p (v θ η ) p, (3) (v θ) p(v θ) where θ η are the model parameters obtaned by updatng θ wth learnng rate η, p denotes an unnormalzed pdf such that p(v θ) = p (v θ)/z(θ) and the requred expectaton s approxmated usng samples from p(v θ). The unnormalzed probabltes p (v θ) are obtaned by ntegratng out the hdden neurons from the ont model: p (v θ) = h p (v,h), whch yelds a smple analytcal form n the case of RBM or GRBM. However, explct ntegraton of the hdden neurons s not tractable for GDBM and therefore one has to employ some approxmatons. We use the approxmaton E(v,h) E(v,µ), h

4 Model β Model Tempered chans Model Tempered chans Base Base Base Fg.. The left fgure shows the evoluton of the nverse temperatures durng the learnng whle β s fxed to 0. The mddle and rght fgures plot the number of swaps between a par of consecutve samples at each update (swap) whle the temperatures were adapted (mddle) and the temperatures were fxed at the equally spaced levels (rght). 0 where µ s the mean-feld approxmaton of the hdden actvatons whch s computed as dscussed n Secton III-A. ote that we use dfferent portons of data (mn-batches) for computng the stochastc gradent and adaptaton of the learnng rate, as was suggested n [9]. Therefore, the fxedpont teratons requred for computng the mean-feld approxmaton have to be run twce. However, ntalzng the meanfeld values wth samples from the model dstrbuton seems to yeld fast convergence. Also, mang only a few fxed-pont teratons (wthout convergence) seems to be enough to get stable behavor of the learnng rate adaptaton. C. Layer-wse pretranng Layer-wse pretranng s commonly used n deep networs to help obtan better models by ntalzng weghts sensbly [23]. DBM requres specal care durng the pretranng phase because the neurons n the ntermedate layers receve sgnal both from the upper and the lower layers, unle n deep belef networs [2]. Salahutdnov proposed cope wth ths problem by halvng the pretraned weghts n the ntermedate layers and duplcatng the vsble and topmost layers durng the pretranng [4]. The pretraned GRBM contanng the vsble layer has the followng energy: E(v,h () θ) = (v b ) 2 2σ 2/ v v σ 2 v c h () h () w, where v = 2 corresponds to duplcatng the vsble layer. Smlarly, the topmost RBM durng pretranng has the energy E(h (L ),h (L) θ) = where we also use h = 2. b h (L ) V. EXPERIMETS ( h c )h (L) h (L ) h (L) ( h w (L ) ), We tran the GDBM model on Olvett face dataset [24]. Out of 0 faces, we used the frst 0 faces of people for tranng and the remanng ones of people as test samples. We test the model n the problem of reconstructng the left half of the face from the rght half. We compared the reconstructon results usng the methodology presented n []. ote that the test set does not contan faces of people from the tranng set for better assessment of the generalzaton slls. The dataset was ntally normalzed such that each pxel has zero-mean and unt varance. We traned GDBM models wth three hdden layers, where each hdden layer had 00 neurons. Unless specfed otherwse, GDBM s traned usng pretranng, enhanced gradent and adaptve learnng rate. The sze of a mnbatch and the number of samples from the model dstrbuton were both set to 64. The ntal learnng rate and ts upper-bound were both set to 0.00 for pretranng and to for ont tranng of all the layers. Weght-decay of 0.00 was used both durng tranng RBM and pretranng DBM. The reconstructon was performed by fxng the nown half of the mage, computng the mean-feld approxmaton of the posteror dstrbuton over all the unnown neurons ncludng the mssng half of the mage. Adaptve Learnng Rate. Our experments confrmed that the learnng rate was able to adapt automatcally usng the proposed strategy. The upper plot n Fgure 3 shows that the algorthm very qucly fnds the approprate regon of learnng rates. After that, the learnng rate slowly decreases, whch s a desred property n stochastc optmzaton (see, e.g., [26]). Parallel Temperng. Parallel temperng yelds pretty good results (see Fg. 4) whle we were not able to acheve convergence of GDBM wth PCD. Fg. ndcates that proposed scheme of temperature adaptaton results n the ncreased and consstent number of swaps among all consecutve pars of tempered chans, and hence, n better mxng. Enhanced Gradent and Pretranng. One obvous way to chec whether the hgher layers of GDBM have learned any useful structure s to nspect the mean-feld approxmaton values of the neurons n those layers gven the tranng or test data samples. When no useful structure was learned, most hdden neurons n the top layer converge near whch means that nearly no sgnal was receved from the neghborng layers. On the other hands, when those neurons actually affect the modelng of the dstrbuton, they converge to the values close to ether 0 or.

5 0 0 0 Hdden neurons at layer Hdden neurons at layer Hdden neurons at layer Hdden neurons at layer Hdden neurons at layer Hdden neurons at layer Hdden neurons at layer Hdden neurons at layer Hdden neurons at layer 3 Fg. 2. The fgures vsualze the mean-feld values of the hdden neurons at dfferent hdden layers when the vsble neurons are fxed to the test data samples. The top fgures were obtaned usng the GDBM traned wthout any pretranng. The mddle and bottom fgures were obtaned from the GDBM traned wth both the pretranng and the ont-tranng usng ether usng the tradtonal gradent or the enhanced gradent, respectvely. All three GDBMs were traned for 2 epochs. One GDBM was traned wthout the layer-wse pretranng, startng from the random ntalzaton. In ths case t was clear that except for the frst hdden layer no upper layer was able to learnanyusefulstructure,asshownonthetoprowoffgure2. Most of approxmated values of the hdden neurons n the second and thrd layers are near. We traned the second GDBM by frst ntalzng t wth layer-wse pretranng. However, ths tme, we dd not use the enhanced gradent for ether the pretranng or the ont tranng.comparngwththetoprowoffgure2,tsapparent from the fgures of the mddle row that the hdden neurons n the upper layers are approxmated closer to 0 or when the layers were pretraned. It suggests that the pretranng s mportant n a sense that t enables the upper layers to learn the structure of the data dstrbuton. However, we were able to observe that many hdden neurons n the hgher layers (see the hdden layer 2, for nstance) do not contrbute much to the modelng of the dstrbuton, as they are ether always nactve (= 0) or always actve (= ). Ths behavor was already dscovered n case of RBM, and t was shown that the enhanced gradent can resolve the problem [9]. Thus, we traned yet another GDBM now by usng the enhanced gradent. The bottom row of Fgure 2 clearly ndcates that the enhanced gradent was able to address the problem. ow the hdden neurons n the layer 3 respond dfferently to the dstnct test samples, and by dong so, encourages the flow of the uncertanty between the hdden layer and the hdden layer 3, enablng the hdden neurons n the layer 3 to capture the structure also. Comparson wth other models. We traned prncpal component analyss (PCA) and GRBM on the same dataset. PCA used 0 prncpal components [], and GRBM had 00 hdden neurons. We lmted the number of the prncpal components to 0 because ncludng more components resulted n stronger overfttng and larger reconstructon errors. The lower plot n Fg. 3 shows the evoluton of the dfference between the orgnal test faces and the reconstructed

Orgnal mages PCA reconstructon GRBM reconstructon GDBM reconstructon Fg. 4. Reconstructed test samples usng PCA, GRBM and GDBM. η 3 4 of the reconstructed faces.

generalze better. Furthermore, the ont tranng of GDBM by the proposed learnng algorthm mproves the performance sgnfcantly.

The evoluton of the learnng rate when the adaptve learnng rate was used (left) and the reconstructon errors usng three dfferent methods: PCA, GRBM, and GDBM (rght). faces for each model.

6 Orgnal mages PCA reconstructon GRBM reconstructon GDBM reconstructon Fg. 4. Reconstructed test samples usng PCA, GRBM and GDBM. η 3 4 of the reconstructed faces. GDBM traned only wth pretranng usng the dentcal GRBM tranng procedure, however, does not become overftted and performs better than GRBM, whch ndcates that the addtonal hdden layers help perform and generalze better. Furthermore, the ont tranng of GDBM by the proposed learnng algorthm mproves the performance sgnfcantly. It suggests that t s ndeed mportant to ontly tran all layers of GDBM n order to obtan a better generatve model. VI. DISCUSSIO RMSE PCA GRBM GDBM Fg. 3. The evoluton of the learnng rate when the adaptve learnng rate was used (left) and the reconstructon errors usng three dfferent methods: PCA, GRBM, and GDBM (rght). faces for each model. It ndcates that GRBM start overfttng qucly, whch results n the ncrease of the reconstructon error of the unseen faces. It s also evdent from the reconstructed faces n Fg. 4. Ultmately, GRBM performs worse than PCA evdenced by both RMSE and the vsual nspecton In ths paper, we descrbed Gaussan-Bernoull deep Boltzmann machne and dscussed ts unversal approxmator property. Based on the learnng algorthm for the bnary DBM [], we adapted three mprovements whch are parallel temperng, enhanced gradent, and adaptve learnng rate for tranng a GDBM. Through the experments we were able to emprcally show that GDBM traned usng these mprovements can acheve good generatve performance. Although they are not presented n ths paper, usng the same hyper-parameters for learnng faces, we were able to tran GDBM on other data sets such as ORB [27] and CIFAR- [7]. It clearly ndcates that the proposed mprovements mae learnng nsenstve to the choce of the learnng hyper-parameters and thus easer. However, the traned GDBMs were not able to produce any state-of-the-art classfcaton accuracy. The dscrepancy between the generatve capablty and the classfcaton performance of GDBM s left for the future research. Recently, several novel approaches have been proposed for effcently tranng DBM. Some of them are; adaptve MCMC samplng [], tempered-transton [4], usng a separate set of recognton weghts [6], centerng trc [28], two-stage pretranng algorthm [29] and metrc-free natural gradent method []. It wll be nterestng to see how they perform

7 when they are used for tranng GDBMs compared to the learnng algorthm proposed n ths paper. REFERECES [] R. Salahutdnov and G. E. Hnton, Deep Boltzmann machnes, n Proceedngs of the Internaton Conference on Artfcal Intellgence and Statstcs (AISTATS 09), 09, pp [2] G. E. Hnton and R. Salahutdnov, Reducng the dmensonalty of data wth neural networs, Scence, vol. 33, no. 786, pp , July 06. [3] Y. Bengo, Learnng deep archtectures for AI, Foundatons and Trends n Machne Learnng, vol. 2, no., pp. 27, 09. [4] B. Lae, R. Salahudnov, J. Gross, and J. Tenenbaum, One shot learnng of smple vsual concepts, n Proceedngs of the 33rd Annual Meetng of the Cogntve Scence Socety,. [] R. Salahutdnov, Learnng deep Boltzmann machnes usng adaptve MCMC, n Proceedngs of the 27th Internatonal Conference on Machne Learnng (ICML ), J. Fürnranz and T. Joachms, Eds. Hafa, Israel: Omnpress, June, pp [6] R. Salahutdnov and H. Larochelle, Effcent learnng of deep boltzmann machnes, n Proceedngs of the 27th Conference on Uncertanty n Artfcal Intellgence,. [7] A. Krzhevsy, Learnng multple layers of features from tny mages, Computer Scence Department, Unversty of Toronto, Tech. Rep., 09. [8] K. Cho, A. Iln, and T. Rao, Improved learnng of Gaussan-Bernoull restrcted Boltzmann machnes, n Proceedngs of the th Internatonal Conference on Artfcal eural etwors (ICA ),. [9] K. Cho, T. Rao, and A. Iln, Enhanced gradent and adaptve learnng rate for tranng restrcted Boltzmann machnes, n Proceedngs of the 28th Internatonal Conference on Machne Learnng(ICML ). ew Yor, Y, USA: ACM, June, pp. 2. [], Gaussan-Bernoull deep Boltzmann machne, n IPS Worshop on Deep Learnng and Unsupervsed Feature Learnng, Serra evada, Span, December. []. Srvastava and R. Salahutdnov, Multmodal learnng wth deep boltzmann machnes, n Advances n eural Informaton Processng Systems, P. Bartlett, F. Perera, C. Burges, L. Bottou, and K. Wenberger, Eds., 2, pp [2] K. Cho, Boltzmann machnes and denosng autoencoders for mage denosng, ArXv e-prnts, Jan. 3. [3] A. S. D.M. Ttterngton and U. Maov, Statstcal Analyss of Fnte Mxture Dstrbutons. ew Yor, London, Sydney: John Wley & Sons, 98. [4] R. Salahutdnov, Learnng n Marov random felds usng tempered transtons, n Advances n eural Informaton Processng Systems 22, Y. Bengo, D. Schuurmans, J. Lafferty, C. K. I. Wllams, and A. Culotta, Eds., 09, pp [] T. Teleman, Tranng restrcted Boltzmann machnes usng approxmatons to the lelhood gradent, n Proceedngs of the th Internaton Conference on Machne Learnng (ICML 08). ew Yor, Y, USA: ACM, 08, pp [6] G. Desardns, A. Courvlle, Y. Bengo, P. Vncent, and O. Delalleau, Parallel temperng for tranng of restrcted Boltzmann machnes, n Proceedngs of the Thrteenth Internatonal Conference on Artfcal Intellgence and Statstcs, ser. JMLR Worshop and Conference Proceedngs, Y.-W. Teh and M. Ttterngton, Eds. JMLR W&CP,, pp. 2. [7] K. Cho, Improved Learnng Algorthms for Restrcted Boltzmann Machnes, Master s thess, Aalto Unversty School of Scence,. [8] K. Cho, T. Rao, and A. Iln, Parallel temperng s effcent for learnng restrcted boltzmann machnes, n eural etwors (IJC), The Internatonal Jont Conference on,, pp. 8. [9] G. Desardns, A. Courvlle, and Y. Bengo, Adaptve parallel temperng for stochastc maxmum lelhood learnng of RBMs, n IPS Worshop on Deep Learnng and Unsupervsed Feature Learnng,. [] T. Rao, K. Cho, and A. Iln, Enhanced gradent for learnng boltzmann machnes (abstract), n The Learnng Worshop, Fort Lauderdale, Florda, Aprl. [2] A. Fscher and C. Igel, Emprcal analyss of the dvergence of Gbbs samplng based learnng algorthms for restrcted Boltzmann machnes, n Proceedngs of the th nternatonal conference on Artfcal neural networs: Part III, ser. ICA. Berln, Hedelberg: Sprnger-Verlag,, pp [22] H. Schulz, A. Müller, and S. Behne, Investgatng Convergence of Restrcted Boltzmann Machne Learnng, n IPS Worshop on Deep Learnng and Unsupervsed Feature Learnng,. [23] D. Erhan, Y. Bengo, A. Courvlle, P.-A. Manzagol, P. Vncent, and S. Bengo, Why does unsupervsed pre-tranng help deep learnng? Journal of Machne Learnng Research, vol., pp , Mar.. [24] F. Samara and A. Harter, Parametersaton of a stochastc model for human face dentfcaton, n Proceedngs of the Second IEEE Worshop on Applcatons of Computer Vson, 994., dec 994, pp [] H. Poon and P. Domngos, Sum-Product networs: A new deep archtecture, n Proceedngs of the 27th Conference on Uncertanty n Artfcal Intellgence,. [26] H. Kushner and G. Yn, Stochastc Approxmaton and Recursve Algorthms and Applcatons. Sprnger, 03. [27] Y. Lecun, F. J. Huang, and L. Bottou, Learnng methods for generc obect recognton wth nvarance to pose and lghtng, vol. 2, 04. [28] G. Montavon and K.-R. Müller, Deep Boltzmann machnes and the centerng trc, n eural etwors: Trcs of the trade, Reloaded, 2nd ed., ser. LCS, G. Montavon, G. B. Orr, and K.-R. Müller, Eds. Sprnger, 2, vol [29] K. Cho, T. Rao, A. Iln, and J. Karhunen, A Two-Stage Pretranng Algorthm for Deep Boltzmann Machnes, n IPS 2 Worshop on Deep Learnng and Unsupervsed Feature Learnng, Lae Tahoe, December 2. [] G. Desardns, R. Pascanu, A. Courvlle, and Y. Bengo, Metrc-free natural gradent for ont-tranng of boltzmann machnes, n Proceedngs of the Frst Internatonal Conference on Learnng Representatons (ICLR 3), 3.

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing CSC321 Tutoral 9: Revew of Boltzmann machnes and smulated annealng (Sldes based on Lecture 16-18 and selected readngs) Yue L Emal: yuel@cs.toronto.edu Wed 11-12 March 19 Fr 10-11 March 21 Outlne Boltzmann