A Better Way to Pretrain Deep Boltzmann Machines

Similar documents
CS229 Lecture notes. Andrew Ng

Expectation-Maximization for Estimating Parameters for a Mixture of Poissons

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks

Stochastic Variational Inference with Gradient Linearization

A Brief Introduction to Markov Chains and Hidden Markov Models

Deep unsupervised learning

Learning Deep Architectures

Convolutional Networks 2: Training, deep convolutional networks

An Efficient Learning Procedure for Deep Boltzmann Machines

Multimodal Learning with Deep Boltzmann Machines

Deep Boltzmann Machines

The Origin of Deep Learning. Lili Mou Jan, 2015

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with?

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Paragraph Topic Classification

Learning Deep Boltzmann Machines using Adaptive MCMC

Soft Clustering on Graphs

Knowledge Extraction from DBNs for Images

FOURIER SERIES ON ANY INTERVAL

An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey Hinton

Projection onto A Nonnegative Max-Heap

How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah

6.434J/16.391J Statistics for Engineers and Scientists May 4 MIT, Spring 2006 Handout #17. Solution 7

Statistical Learning Theory: A Primer

), enthalpy transport (i.e., the heat content that moves with the molecules)

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Explicit overall risk minimization transductive bound

STA 216 Project: Spline Approach to Discrete Survival Analysis

17 Lecture 17: Recombination and Dark Matter Production

ANALYSIS OF FLOW INSIDE THE FOCUSING TUBE OF THE ABRASIVE WATER JET CUTTING HEAD

A. Distribution of the test statistic

ASummaryofGaussianProcesses Coryn A.L. Bailer-Jones

Optimality of Inference in Hierarchical Coding for Distributed Object-Based Representations

Multilayer Kerceptron

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Stat 155 Game theory, Yuval Peres Fall Lectures 4,5,6

AST 418/518 Instrumentation and Statistics

How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah

Steepest Descent Adaptation of Min-Max Fuzzy If-Then Rules 1

Learning Fully Observed Undirected Graphical Models

Optimal Blind Nonlinear Least-Squares Carrier Phase and Frequency Offset Estimation for Burst QAM Modulations

From Margins to Probabilities in Multiclass Learning Problems

Uniformly Reweighted Belief Propagation: A Factor Graph Approach

Deep Gaussian Processes for Multi-fidelity Modeling

Simulations of Droplets falling on a solid surface Using Phase-Field Method

An Algorithm for Pruning Redundant Modules in Min-Max Modular Network

Power Control and Transmission Scheduling for Network Utility Maximization in Wireless Networks

arxiv: v2 [cond-mat.stat-mech] 14 Nov 2008

Discrete Techniques. Chapter Introduction

Cryptanalysis of PKP: A New Approach

Model-based Clustering by Probabilistic Self-organizing Maps

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

A Novel Learning Method for Elman Neural Network Using Local Search

Deep Belief Networks are compact universal approximators

Lecture 16 Deep Neural Generative Models

hole h vs. e configurations: l l for N > 2 l + 1 J = H as example of localization, delocalization, tunneling ikx k

Appendix for Stochastic Gradient Monomial Gamma Sampler

The Relationship Between Discrete and Continuous Entropy in EPR-Steering Inequalities. Abstract

Discrete Techniques. Chapter Introduction

BP neural network-based sports performance prediction model applied research

On the Goal Value of a Boolean Function

arxiv:hep-ph/ v1 15 Jan 2001

Denoising Autoencoders

Chemical Kinetics Part 2

SUPPLEMENTARY MATERIAL TO INNOVATED SCALABLE EFFICIENT ESTIMATION IN ULTRA-LARGE GAUSSIAN GRAPHICAL MODELS

pp in Backpropagation Convergence Via Deterministic Nonmonotone Perturbed O. L. Mangasarian & M. V. Solodov Madison, WI Abstract

Appendix for Stochastic Gradient Monomial Gamma Sampler

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih

A Solution to the 4-bit Parity Problem with a Single Quaternary Neuron

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

MARKOV CHAINS AND MARKOV DECISION THEORY. Contents

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Statistical Learning Theory: a Primer

Issues Causal Modeling is Usually Unidirectional

C. Fourier Sine Series Overview

UNSUPERVISED LEARNING

THE THREE POINT STEINER PROBLEM ON THE FLAT TORUS: THE MINIMAL LUNE CASE

arxiv: v3 [cs.lg] 18 Mar 2013

Learning Structured Weight Uncertainty in Bayesian Neural Networks

Moreau-Yosida Regularization for Grouped Tree Structure Learning

An Information Geometrical View of Stationary Subspace Analysis

Inductive Bias: How to generalize on novel data. CS Inductive Bias 1

Lecture 6: Moderately Large Deflection Theory of Beams

The EM Algorithm applied to determining new limit points of Mahler measures

Chemical Kinetics Part 2. Chapter 16

Determining The Degree of Generalization Using An Incremental Learning Algorithm

Phase Change Equation of State for FSI Applications

Problem set 6 The Perron Frobenius theorem.

Two view learning: SVM-2K, Theory and Practice

Notes on Backpropagation with Cross Entropy

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Greedy Layer-Wise Training of Deep Networks

The Binary Space Partitioning-Tree Process Supplementary Material

A proposed nonparametric mixture density estimation using B-spline functions

First-Order Corrections to Gutzwiller s Trace Formula for Systems with Discrete Symmetries

Paper presented at the Workshop on Space Charge Physics in High Intensity Hadron Rings, sponsored by Brookhaven National Laboratory, May 4-7,1998

<C 2 2. λ 2 l. λ 1 l 1 < C 1

Supervised i-vector Modeling - Theory and Applications

FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA)

Transcription:

A Better Way to Pretrain Deep Botzmann Machines Rusan Saakhutdino Department of Statistics and Computer Science Uniersity of Toronto rsaakhu@cs.toronto.edu Geoffrey Hinton Department of Computer Science Uniersity of Toronto hinton@cs.toronto.edu Abstract We describe how the pretraining agorithm for Deep Botzmann Machines (DBMs is reated to the pretraining agorithm for Deep Beief Networks and we show that under certain conditions, the pretraining procedure improes the ariationa ower bound of a two-hidden-ayer DBM. Based on this anaysis, we deeop a different method of pretraining DBMs that distributes the modeing work more eeny oer the hidden ayers. Our resuts on the MNIST and NORB datasets demonstrate that the new pretraining agorithm aows us to earn better generatie modes. Introduction A Deep Botzmann Machine (DBM is a type of binary pairwise Marko Random Fied with mutipe ayers of hidden random ariabes. Maximum ikeihood earning in DBMs, and other reated modes, is ery difficut because of the hard inference probem induced by the partition function [3,, 2, 6]. Mutipe ayers of hidden units make earning in DBM s far more difficut [3]. Learning meaningfu DBM modes, particuary when modeing high-dimensiona data, reies on the heuristic greedy pretraining procedure introduced by [7], which is based on earning a stack of modified Restricted Botzmann Machines (RBMs. Unfortunatey, unike the pretraining agorithm for Deep Beief Networks (DBNs, the existing procedure acks a proof that adding additiona ayers improes the ariationa bound on the og-probabiity that the mode assigns to the training data. In this paper, we first show that under certain conditions, the pretraining agorithm improes a ariationa ower bound of a two-ayer DBM. This resut gies a much deeper understanding of the reationship between the pretraining agorithms for Deep Botzmann Machines and Deep Beief Networks. Using this understanding, we introduce a new pretraining procedure for DBMs and show that it aows us to earn better generatie modes of handwritten digits and 3D objects. 2 Deep Botzmann Machines (DBMs A Deep Botzmann Machine is a network of symmetricay couped stochastic binary units. It contains a set of isibe units {0, } D, and a series of ayers of hidden units {0, } F, {0, } F2,..., h (L {0, } F L. There are connections ony between units in adjacent ayers. Consider a DBM with three hidden ayers, as shown in Fig., eft pane. The probabiity that the DBM assigns to a isibe ector is: P (; θ = Z(θ h ( exp ij i j + ij j j j + W (3 m h(2 m, ( m

Pretraining RBM Deep Beief Network Deep Botzmann Machine W (3 2W (3 W (3 W (3 RBM W (3 2 2 2 RBM Figure : Left: Deep Beief Network (DBN and Deep Botzmann Machine (DBM. The top two ayers of a DBN form an undirected graph and the remaining ayers form a beief net with directed, top-down connections. For a DBM, a the connections are undirected. Right Pretraining a DBM with three hidden ayers consists of earning a stack of RBMs that are then composed to create a DBM. The first and ast RBMs in the stack need to be modified by using asymmetric weights. where h = {,, } are the set of hidden units, and θ = {,, W (3 } are the mode parameters, representing isibe-to-hidden and hidden-to-hidden symmetric interaction terms. Setting =0 and W (3 =0 recoers the Restricted Botzmann Machine (RBM mode. Approximate Learning: Exact maximum ikeihood earning in this mode is intractabe, but efficient approximate earning of DBMs can be carried out by using a mean-fied inference to estimate data-dependent expectations, and an MCMC based stochastic approximation procedure to approximate the mode s expected sufficient statistics [7]. In particuar, consider approximating the true posterior P (h ; θ with a fuy factorized approximating distribution oer the three sets of hidden F2 = F3 k= q(h( j q( q( = = µ ( units: Q(h ; µ = F j= k, where µ = {µ(, µ (2, µ (3 } are the mean-fied parameters with q(h ( i i for =, 2, 3. In this case, we can write down the ariationa ower bound on the og-probabiity of the data, which takes a particuary simpe form: og P (; θ µ ( + µ ( µ (2 + µ (2 W (3 µ (3 og Z(θ + H(Q, (2 where H( is the entropy functiona. Learning proceeds by finding the aue of µ that maximizes this ower bound for the current aue of mode parameters θ, which resuts in a set of the mean-fied fixed-point equations. Gien the ariationa parameters µ, the mode parameters θ are then updated to maximize the ariationa bound using stochastic approximation (for detais see [7,, 4, 5]. 3 Pretraining Deep Botzmann Machines The aboe earning procedure works quite poory when appied to DBMs that start with randomy initiaized weights. Hidden units in higher ayers are ery under-constrained so there is no consistent earning signa for their weights. To aeiate this probem, [7] introduced a ayer-wise pretraining agorithm based on earning a stack of modified Restricted Botzmann Machines (RBMs. The idea behind the pretraining agorithm is straightforward. When earning parameters of the first ayer RBM, the bottom-up weights are constrained to be twice the top-down weights (see Fig., right pane. Intuitiey, using twice the weights when inferring the states of the hidden units compensates for the initia ack of top-down feedback. Conersey, when pretraining the ast RBM in the stack, the top-down weights are constrained to be twice the bottom-up weights. For a the intermediate RBMs the weights are haed in both directions when composing them to form a DBM, as shown in Fig., right pane. This heuristic pretraining agorithm works surprisingy we in practice. Howeer, it is soey motiated by the need to end up with a mode that has symmetric weights, and does not proide any We omit the bias terms for carity of presentation. 2

usefu insights into what is happening during the pretraining stage. Furthermore, unike the pretraining agorithm for Deep Beief Networks (DBNs, it acks a proof that each time a ayer is added to the DBM, the ariationa bound improes. 3. Pretraining Agorithm for Deep Beief Networks We first briefy reiew the pretraining agorithm for Deep Beief Networks [2], which wi form the basis for deeoping a new pretraining agorithm for Deep Botzmann Machines. Consider pretraining a two-ayer DBN using a stack of RBMs. After earning the first RBM in the stack, we can write the generatie mode as: p(; = p( h( ; p( ;. The second RBM in the stack attempts to repace the prior p( ; by a better mode p( ; = p(h(, ;, thus improing the fit to the training data. More formay, for any approximating distribution Q(, the DBN s og-ikeihood has the foowing ariationa ower bound on the og probabiity of the training data {,..., N }: N og P ( n n n= E Q( n [ ] og P ( n ; n ( KL Q( n P ( ;. We set Q( n ; = P ( n ;, which is the true factoria posterior of the first ayer RBM. Initiay, when =, Q( n defines the DBN s true posterior oer, and the bound is tight. Maximizing the bound with respect to ony affects the ast KL term in the aboe equation, and amounts to maximizing: N N n= Q( n ; P ( ;. (3 This is equiaent to training the second ayer RBM with ectors drawn from Q( ; as data. Hence, the second RBM in the stack earns a better mode of the mixture oer a N training cases: /N n Q(h( n ;, caed the aggregated posterior. This scheme can be extended to training higher-ayer RBMs. Obsere that during the pretraining stage the whoe prior of the ower-ayer RBM is repaced by the next RBM in the stack. This eads to the hybrid Deep Beief Network mode, with the top two ayers forming a Restricted Botzmann Machine, and the ower ayers forming a directed sigmoid beief network (see Fig., eft pane. 3.2 A Variationa Bound for Pretraining a Two-ayer Deep Botzmann Machine Consider a simpe two-ayer DBM with tied weights =, as shown in Fig. 2a: P (; = exp ( +. (4 Z(, Simiar to DBNs, for any approximate posterior Q(, we can write a ariationa ower bound on the og probabiity that this DBM assigns to the training data: N og P ( n n n= E Q( n [ ] og P ( n ; n ( KL Q( n P ( ;. (5 The key insight is to note that the mode s margina distribution oer is the product of two identica distributions, one defined by an RBM composed of and, and the other defined by an identica RBM composed of and [8]: P ( ; = Z( ( ( e }{{} RBM with and 3 e } {{ } RBM with and. (6

h (2a h (2b a b c = Figure 2: Left: Pretraining a Deep Botzmann Machine with two hidden ayers. a The DBM with tied weights. b The second RBM with two sets of repicated hidden units, which wi repace haf of the st RBM s prior. c The resuting DBM with modified second hidden ayer. Right: The DBM with tied weights is trained to mode the data using one-step contrastie diergence. The idea is to keep one of these two RBMs and repace the other by the square root of a better prior P ( ;. In particuar, another RBM with two sets of repicated hidden units and tied weights P ( ; = h (2a,h P (2b (h(, h (2a, h (2b ; is trained to be a better mode of the aggregated ariationa posterior N n Q(h( n ; of the first mode (see Fig. 2b. By initiaizing =, the second-ayer RBM has exacty the same prior oer as the origina DBM. If the RBM is trained by maximizing the og ikeihood objectie: Q( n og P ( ;, (7 n then we obtain: KL(Q( n P ( ; KL(Q( n P ( ;. (8 n n Simiar to Eq. 6, the distribution oer defined by the second-ayer RBM is aso the product of two identica distributions. Once the two RBMs are composed to form a two-ayer DBM mode (see Fig. 2c, the margina distribution oer is the geometric mean of the two probabiity distributions: P ( ;, P ( ; defined by the first and second-ayer RBMs: P ( ;, = Z(, ( ( e e h( h. (2 (9 Based on Eqs. 8, 9, it is easy to show that the ariationa ower bound of Eq. 5 improes because repacing haf of the prior by a better mode reduces the KL diergence from the ariationa posterior: ( KL Q( n P ( ;, ( KL Q( n P ( ;. (0 n n Due to the conexity of asymmetric diergence, this is guaranteed to improe the ariationa bound of the training data by at east haf as much as fuy repacing the origina prior. This resut highights a major difference between DBNs and DBMs. The procedure for adding an extra ayer to a DBN repaces the fu prior oer the preious top ayer, whereas the procedure for adding an extra ayer to a DBM ony repaces haf of the prior. So in a DBM, the weights of the bottom ee RBM perform much more of the work than in a DBN, where the weights are ony used to define the ast stage of the generatie process P ( ;. This resut aso suggests that adding ayers to a DBM wi gie diminishing improements in the ariationa bound, compared to adding ayers to a DBN. This may expain why DBMs with three hidden ayers typicay perform worse than the DBMs with two hidden ayers [7, 8]. On the other hand, the disadantage of the pretraining procedure for Deep Beief Networks is that the top-ayer RBM is forced to do most of the modeing work. This may aso expain the need to use a arge number of hidden units in the top-ayer RBM [2]. There is, howeer, a way to design a new pretraining agorithm that woud spread the modeing work more equay across a ayers, hence bypassing shortcomings of the existing pretraining agorithms for DBNs and DBMs. 4

Repacing 2/3 of the Prior Practica Impementation a b a b c a b a b c 3 2 3 Figure 3: Left: Pretraining a Deep Botzmann Machine with two hidden ayers. a The DBM with tied weights. b The second ayer RBM is trained to mode 2 /3 of the st RBM s prior. c The resuting DBM with modified second hidden ayer. Right: The corresponding practica impementation of the pretraining agorithm that uses asymmetric weights. 3.3 Controing the Amount of Modeing Work done by Each Layer Consider a sighty modified two-ayer DBM with two groups of repicated 2 nd -ayer units, h (2a and h (2b, and tied weights (see Fig. 3a. The mode s margina distribution oer is the product of three identica RBM distributions, defined by and, and h (2a, and and h (2b : P ( ; = Z( ( ( ( e e h(2a e h(2b h. ( h (2a h (2b During the pretraining stage, we keep one of these RBMs and repace the other two by a better prior P ( ;. To do so, simiar to Sec. 3.2, we train another RBM, but with three sets of hidden units and tied weights (see Fig. 3b. When we combine the two RBMs into a DBM, the margina distribution oer is the geometric mean of three probabiity distributions: one defined by the first-ayer RBM, and the remaining two defined by the second-ayer RBMs: P ( ;, = Z(, P (h( ; P ( ; P ( ; ( ( ( = e e h(2a e h(2b h. ( Z(, h (2a h (2b In this DBM, 2 /3 of the first RBM s prior oer the first hidden ayer has been repaced by the prior defined by the second-ayer RBM. The ariationa bound on the training data is guaranteed to improe by at east 2 /3 as much as fuy repacing the origina prior. Hence in this sighty modified DBM mode, the second ayer performs 2 /3 of the modeing work compared to the first ayer. Ceary, controing the number of repicated hidden groups aows us to easiy contro the amount of modeing work eft to the higher ayers in the stack. 3.4 Practica Impementation So far, we hae made the assumption that we start with a two-ayer DBM with tied weights. We now specify how one woud train this initia set of tied weights. Let us consider the origina two-ayer DBM in Fig. 2a with tied weights. If we knew the initia state ector, we coud train this DBM using one-step contrastie diergence (CD with mean fied reconstructions of both the isibe states and the top-ayer states, as shown in Fig. 2, right pane. Instead, we simpy set the initia state ector to be equa to the data,. Using mean-fied reconstructions for and, one-step CD is exacty equiaent to training a modified RBM with ony one hidden ayer but with bottom-up weights that are twice the top-down weights, as defined in the origina pretraining agorithm (see Fig., right pane. This way of training the simpe DBM with tied weights is unikey to maximize the ikeihood objectie, but in practice it produces surprisingy good modes that reconstruct the training data we. When earning the second RBM in the stack, instead of maintaining a set of repicated hidden groups, it wi often be conenient to approximate CD earning by training a modified RBM with one hidden ayer but with asymmetric bottom-up and top-down weights. 2 5

For exampe, consider pretraining a two-ayer DBM, in which we woud ike to spit the modeing work between the st and 2 nd -ayer RBMs as /3 and 2 /3. In this case, we train the first ayer RBM using one-step CD, but with the bottom-up weights constrained to be three times the top-down weights (see Fig. 3, right pane. The conditiona distributions needed for CD earning take form: P ( j = = + exp( ( i 3W ij i, P ( i = = + exp( j ij h( j. Conersey, for the second modified RBM in the stack, the top-down weights are constrained to be 3/2 times the bottom-up weights. The conditiona distributions take form: P ( = = + exp( j (2 2W j j, P (h( j = = + exp( (2 3W j. Note that this second-ayer modified RBM simpy approximates the proper RBM with three sets of repicated groups. In practice, this simpe approximation works we compared to training a proper RBM, and is much easier to impement. When combining the RBMs into a two-ayer DBM, we end up with and 2 in the first and second ayers, each performing /3 and 2 /3 of the modeing work respectiey: P (; θ = Z(θ, exp ( + 2. ( Parameters of the entire mode can be generatiey fine-tuned using the combination of the meanfied agorithm and the stochastic approximation agorithm described in Sec. 2 4 Pretraining a Three Layer Deep Botzmann Machine In the preious section, we showed that proided we start with a two-ayer DBM with tied weights, we can train the second-ayer RBM in a way that is guaranteed to improe the ariationa bound. For the DBM with more than two ayers, we hae not been abe to deeop a pretraining agorithm that is guaranteed to improe a ariationa bound. Howeer, resuts of Sec. 3 suggest that using simpe modifications when pretraining a stack of RBMs woud aow us to approximatey contro the amount of modeing work done by each ayer. 3 Pretraining a 3-ayer DBM 2W (3 4W (3 2W (3 4 3 Figure 4: Layer-wise pretraining of a 3-ayer Deep Botzmann Machine. 2 Consider earning a 3-ayer DBM, in which each ayer is forced to perform approximatey /3 of the modeing work. This can easiy be accompished by earning a stack of three modified RBMs. Simiar to the two-ayer mode, we train the first ayer RBM using one-step CD, but with the bottom-up weights constrained to be three times the top-down weights (see Fig. 4. Two-thirds of this RBM s prior wi be modeed by the 2 nd and 3 rd -ayer RBMs. For the second modified RBM in the stack, we use 4 bottom-up and 3 topdown. Note that we are using 4 bottom-up, as we are expecting to repace haf of the second RBM prior by a third RBM, hence spitting the remaining 2 /3 of the work equay between the top two ayers. If we were to pretrain ony a two-ayer DBM, we woud use 2 bottom-up and 3 top-down, as discussed in Sec. 3.2. For the ast RBM in the stack, we use 2W (3 bottom-up and 4 top-down. When combining the three RBMs into a three-ayer DBM, we end up with symmetric weights, 2, and 2W (3 in the first, second, and third ayers, with each ayer performing /3 of the modeing work: P (; θ = Z(θ exp h ( + 2 + 2W (3. (2 6

Agorithm Greedy Pretraining Agorithm for a 3-ayer Deep Botzmann Machine : Train the st ayer RBM using one-step CD earning with mean fied reconstructions of the isibe ectors. Constrain the bottom-up weights, 3, to be three times the top-down weights,. 2: Freeze 3 that defines the st ayer of features, and use sampes from P ( ; 3 as the data for training the second RBM. 3: Train the 2 nd ayer RBM using one-step CD earning with mean fied reconstructions of the isibe ectors. Set the bottom-up weights to 4, and the top-down weights to 3. 4: Freeze 4 that defines the 2 nd ayer of features and use the sampes from P ( ; 4 as the data for training the next RBM. 5: Train the 3 rd -ayer RBM using one-step CD earning with mean fied reconstructions of its isibe ectors. During the earning, set the bottom-up weights to 2W (3, and the top-down weights to 4W (3. 6: Use the weights {, 2, 2W (3 } to compose a three-ayer Deep Botzmann Machine. The new pretraining procedure for a 3-ayer DBM is shown in Ag.. Note that compared to the origina agorithm, it requires amost no extra work and can be easiy integrated into existing code. Extensions to training DBMs with more ayers is triia. As we show in our experimenta resuts, this pretraining can improe the generatie performance of Deep Botzmann Machines. 5 Experimenta Resuts In our experiments we used the MNIST and NORB datasets. During greedy pretraining, each ayer was trained for 00 epochs using one-step contrastie diergence. Generatie fine-tuning of the fu DBM mode, using mean-fied together with stochastic approximation, required 300 epochs. In order to estimate the ariationa ower-bounds achieed by different pretraining agorithms, we need to estimate the goba normaization constant. Recenty, [0] demonstrated that Anneaed Importance Samping (AIS can be used to efficienty estimate the partition function of an RBM. We adopt AIS in our experiments as we. Together with ariationa inference this wi aow us to obtain good estimates of the ower bound on the og-probabiity of the training and test data. 5. MNIST The MNIST digit dataset contains 60,000 training and 0,000 test images of ten handwritten digits (0 to 9, with 28 28 pixes. In our first experiment, we considered a standard two-ayer DBM with 500 and 000 hidden units 2, and used two different agorithms for pretraining it. The first pretraining agorithm, which we ca DBM- /2- /2, is the origina agorithm for pretraining DBMs, as introduced by [7] (see Fig.. Here, the modeing work between the st and 2 nd -ayer RBMs is spit equay. The second agorithm, DBM- /3-2 /3, uses a modified pretraining procedure of Sec. 3.4, so that the second RBM in the stack ends up doing 2 /3 of the modeing work compared to the st -ayer RBM. Resuts are shown in Tabe. Prior to the goba generatie fine-tuning, the estimate of the ower bound on the aerage test og-probabiity for DBM- /3-2 /3 was 08.65 per test case, compared to 4.32 achieed by the standard pretraining agorithm DBM- /2- /2. The arge difference of about 7 nats shows that eaing more of the modeing work to the second ayer, which has a arger number of hidden units, substantiay improes the ariationa bound. After the goba generatie fine-tuning, DBM- /3-2 /3 achiees a ower bound of 83.43, which is better compared to 84.62, achieed by DBM- /2- /2. This is aso ower compared to the ower bound of 85.97, achieed by a carefuy trained two-hidden-ayer Deep Beief Network [0]. In our second experiment, we pretrained a 3-ayer Deep Botzmann Machine with 500, 500, and 000 hidden units. The existing pretraining agorithm, DBM- /2- /4- /4, approximatey spits the modeing between three RBMs in the stack as /2, /4, /4, so the weights in the st -ayer RBM perform haf of the work compared to the higher-ee RBMs. On the other hand, the new pretraining procedure (see Ag., which we ca DBM- /3- /3- /3, spits the modeing work equay across a three ayers. 2 These architectures hae been considered before in [7, 9], which aows us to proide a direct comparison. 7

Tabe : MNIST: Estimating the ower bound on the aerage training and test og-probabiities for two DBMs: one with two ayers (500 and 000 hidden units, and the other one with three ayers (500, 500, and 000 hidden units. Resuts are shown for arious pretraining agorithms, foowed by generatie fine-tuning. Pretraining Generatie Fine-Tuning Train Test Train Test 2 ayers DBM- /2- /2 3.32 4.32 83.6 84.62 DBM- /3-2 /3 07.89 08.65 82.83 83.43 3 ayers DBM- /2- /4- /4 6.74 7.38 84.49 85.0 DBM- /3- /3- /3 07.2 07.65 82.34 83.02 Tabe 2: NORB: Estimating the ower bound on the aerage training and test og-probabiities for two DBMs: one with two ayers (000 and 2000 hidden units, and the other one with three ayers (000, 000, and 2000 hidden units. Resuts are shown for arious pretraining agorithms, foowed by generatie fine-tuning. Pretraining Generatie Fine-Tuning Train Test Train Test 2 ayers DBM- /2- /2 640.94 643.87 598.3 60.76 DBM- /3-2 /3 633.2 636.65 593.76 597.23 3 ayers DBM- /2- /4- /4 64.87 645.06 598.98 602.84 DBM- /3- /3- /3 632.75 635.4 592.87 596. Tabe shows that DBM- /3- /3- /3 achiees a ower bound on the aerage test og-probabiity of 07.65, improing upon DBM- /2- /4- /4 s bound of 7.38. The difference of about 0 nats further demonstrates that during the pretraining stage, it is rather crucia to push more of the modeing work to the higher ayers. After generatie fine-tuning, the bound on the test og-probabiities for DBM- /3- /3- /3 was 83.02, so with a new pretraining procedure, the three-hidden-ayer DBM performs sighty better than the two-hidden-ayer DBM. With the origina pretraining procedure, the 3-ayer DBM achiees a bound of 85.0, which is worse than the bound of 84.62, achieed by the 2-ayer DBM, as reported by [7, 9]. 5.2 NORB The NORB dataset [4] contains images of 50 different 3D toy objects with 0 objects in each of fie generic casses: cars, trucks, panes, animas, and humans. Each object is photographed from different iewpoints and under arious ighting conditions. The training set contains 24,300 stereo image pairs of 25 objects, 5 per cass, whie the test set contains 24,300 stereo pairs of the remaining, different 25 objects. From the training data, 4,300 were set aside for aidation. To dea with raw pixe data, we foowed the approach of [5] by first earning a Gaussian-binary RBM with 4000 hidden units, and then treating the the actiities of its hidden ayer as preprocessed binary data. Simiar to the MNIST experiments, we trained two Deep Botzmann Machines: one with two ayers (000 and 2000 hidden units, and the other one with three ayers (000, 000, and 2000 hidden units. Tabe 2 reeas that for both DBMs, the new pretraining achiees much better ariationa bounds on the aerage test og-probabiity. Een after the goba generatie fine-tuning, Deep Botzmann Machines, pretrained using a new agorithm, improe upon standard DBMs by at east 5 nats. 6 Concusion In this paper we proided a better understanding of how the pretraining agorithms for Deep Beief Networks and Deep Botzmann Machines are reated, and used this understanding to deeop a different method of pretraining. Unike many of the existing pretraining agorithms for DBNs and DBMs, the new procedure can distribute the modeing work more eeny oer the hidden ayers. Our resuts on the MNIST and NORB datasets demonstrate that the new pretraining agorithm aows us to earn much better generatie modes. Acknowedgments This research was funded by NSERC, Eary Researcher Award, and gifts from Microsoft and Googe. G.H. and R.S. are feows of the Canadian Institute for Adanced Research. 8

References [] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2009. [2] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast earning agorithm for deep beief nets. Neura Computation, 8(7:527 554, 2006. [3] H. Larochee, Y. Bengio, J. Louradour, and P. Lambin. Exporing strategies for training deep neura networks. Journa of Machine Learning Research, 0: 40, 2009. [4] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition with inariance to pose and ighting. In CVPR (2, pages 97 04, 2004. [5] V. Nair and G. E. Hinton. Impicit mixtures of restricted Botzmann machines. In Adances in Neura Information Processing Systems, oume 2, 2009. [6] M. A. Ranzato. Unsuperised earning of feature hierarchies. In Ph.D. New York Uniersity, 2009. [7] R. R. Saakhutdino and G. E. Hinton. Deep Botzmann machines. In Proceedings of the Internationa Conference on Artificia Inteigence and Statistics, oume 2, 2009. [8] R. R. Saakhutdino and G. E. Hinton. An efficient earning procedure for Deep Botzmann Machines. Neura Computation, 24:967 2006, 202. [9] R. R. Saakhutdino and H. Larochee. Efficient earning of deep Botzmann machines. In Proceedings of the Internationa Conference on Artificia Inteigence and Statistics, oume 3, 200. [0] R. R. Saakhutdino and I. Murray. On the quantitatie anaysis of deep beief networks. In Proceedings of the Internationa Conference on Machine Learning, oume 25, pages 872 879, 2008. [] T. Tieeman. Training restricted Botzmann machines using approximations to the ikeihood gradient. In ICML. ACM, 2008. [2] M. Weing and G. E. Hinton. A new earning agorithm for mean fied Botzmann machines. Lecture Notes in Computer Science, 245, 2002. [3] M. Weing and C. Sutton. Learning in marko random fieds with contrastie free energies. In Internationa Workshop on AI and Statistics (AISTATS 2005, 2005. [4] L. Younes. On the conergence of Markoian stochastic agorithms with rapidy decreasing ergodicity rates, March 7 2000. [5] A. L. Yuie. The conergence of contrastie diergences. In Adances in Neura Information Processing Systems, 2004. 9