Representational Power of Restricted Boltzmann Machines

Size: px

Start display at page:

Download "Representational Power of Restricted Boltzmann Machines"

Scarlett Price
6 years ago
Views:

1 1 Representational Power of Restricted Boltzmann Macines and Deep Belief Networks Nicolas Le Roux and Yosua Bengio Dept. IRO, Uniersité de Montréal C.P. 6128, Montreal, Qc, H3C 3J7, Canada ttp:// lisa Abstract Deep Belief Networks (DBN are generatie neural network models wit many layers of idden explanatory factors, recently introduced by Hinton et al., along wit a greedy layer-wise unsuperised learning algoritm. Te building block of a DBN is a probabilistic model called a Restricted Boltzmann Macine (RBM, used to represent one layer of te model. Restricted Boltzmann Macines are interesting because inference is easy in tem, and because tey ae been successfully used as building blocks for training deeper models. We first proe tat adding idden units yields strictly improed modelling power, wile a second teorem sows tat RBMs are uniersal approximators of discrete distributions. We ten study te question of weter DBNs wit more layers are strictly more powerful in terms of representational power. Tis suggests a new and less greedy criterion for training RBMs witin DBNs. 1 Introduction Learning algoritms tat learn to represent functions wit many leels of composition are said to ae a deep arcitecture. Bengio and Le Cun (2007 discuss results in computational teory of circuits tat strongly suggest tat deep arcitectures are muc more efficient in terms of representation (number of computational elements, number of parameters tan teir sallow counterparts. In spite of te fact tat 2-leel arcitectures (e.g., a one-idden layer neural network, a kernel macine, or a 2-leel digital circuit are able to represent any function (see for example (Hornik, Stinccombe, & Wite, 1989, tey may need a uge number of elements and, consequently, of training examples. For example, te parity function on d bits (wic associates te alue 1 wit a ector if as an odd number of bits equal to 1 and 0 oterwise can be implemented by a digital circuit of dept log(d wit O(d elements but requires O(2 d elements to be represented by a 2-leel digital circuit (Ajtai, 1983 (e.g., in conjunctie or disjunctie normal form. We proed a similar result for Gaussian kernel macines: tey require O(2 d non-zero coefficients (i.e., support ectors in a Support Vector Macine to represent suc igly

2 arying functions (Bengio, Delalleau, & Le Roux, 2006a. On te oter and, training learning algoritms wit a deep arcitecture (suc as neural networks wit many idden layers appears to be a callenging optimization problem (Tesauro, 1992; Bengio, Lamblin, Popoici, & Larocelle, Hinton, Osindero, and Te (2006 introduced a greedy layer-wise unsuperised learning algoritm for Deep Belief Networks (DBN. Te training strategy for suc networks may old great promise as a principle to elp address te problem of training deep networks. Upper layers of a DBN are supposed to represent more abstract concepts tat explain te input data wereas lower layers extract low-leel features from te data. In (Bengio et al., 2007; Ranzato, Poultney, Copra, & LeCun, 2007, tis greedy layer-wise principle is found to be applicable to models oter tan DBNs. DBNs and RBMs ae already been applied successfully to a number of classification, dimensionality reduction, information retrieal, and modelling tasks (Welling, Rosen-Zi, & Hinton, 2005; Hinton et al., 2006; Hinton & Salakutdino, 2006; Bengio et al., 2007; Salakutdino & Hinton, In tis paper we sow tat adding idden units yields strictly improed modelling power, unless te RBM already perfectly models te data. Ten, we proe tat an RBM can model any discrete distribution, a property similar to tose of neural networks wit one idden layer. Finally, we discuss te representational power of DBNs and find a puzzling result about te best tat could be acieed wen going from 1-layer to 2-layer DBNs. Note tat te proofs of uniersal approximation by RBMs are constructie but tese constructions are not practical as tey would lead RBMs wit potentially as many idden units as examples, and tis would defy te purpose of using RBMs as building blocks of a deep network tat efficiently represents te input distribution. Important teoretical questions terefore remain unanswered concerning te potential for DBNs tat stack multiple RBMs to represent a distribution efficiently. 1.1 Background on RBMs Definition and properties A Restricted Boltzmann Macine (RBM is a particular form of te Product of Experts model (Hinton, 1999, 2002 wic is also a Boltzmann Macine (Ackley, Hinton, & Sejnowski, 1985 wit a bipartite connectiity grap. An RBM wit n idden units is a parametric model of te joint distribution between idden ariables i (explanatory factors, collected in ector and obsered ariables j (te example, collected in ector, of te form p(, exp( E(, = e T W +b T +c T wit parameters θ = (W, b, c and j, i {0, 1}. E(, is called te energy of te state (,. We consider ere te simpler case of binary units. It is straigtforward to sow tat P ( = j P ( j and P ( j = 1 = sigm(b j + i W ij i (were sigm 1 is te sigmoid function defined as sigm(x = 1+exp( x, and P ( as a similar form: P ( = i P ( i and P ( i = 1 = sigm(c i + j W ij j. Altoug te marginal distribution p( is not tractable, it can be easily computed up to a normalizing constant. 2

3 Furtermore, one can also sample from te model distribution using Gibbs sampling. Consider a Monte-Carlo Marko cain (MCMC initialized wit sampled from te empirical data distribution (distribution denoted p 0. After sampling from P (, sample from P (, wic follows a distribution denoted p 1. After k suc steps we ae samples from p k, and te model s generatie distribution is p (due to conergence of te Gibbs MCMC Training and Contrastie Diergence Carreira-Perpiñan and Hinton (2005 sowed tat te deriatie of te log-likeliood of te data under te RBM wit respect to te parameters is log p(, log E(, log E(, = + (1 θ θ θ were aeraging is oer bot and, 0 denotes an aerage wit respect to p 0 (te data distribution multiplied by P (, and denotes an aerage wit respect to p (te model distribution: p (, = p(,. Since computing te aerage oer te true model distribution is intractable, Hinton et al. (2006 use an approximation of tat deriatie called contrastie diergence (Hinton, 1999, 2002: one replaces te aerage wit k for relatiely small alues of k. For example, in Hinton et al. (2006, Hinton and Salakutdino (2006, Bengio et al. (2007, Salakutdino and Hinton (2007, one uses k = 1 wit great success. Te aerage oer s from p 0 is replaced by a sample from te empirical distribution (tis is te usual stocastic gradient sampling trick and te aerage oer s from p 1 is replaced by a single sample from te Marko cain. Te resulting gradient estimator inoles only ery simple computations, and for te case of binary units, te gradient estimator on weigt W ij is simply P ( i = 1 j P ( i = 1 j, were is a sample from p 1 and is te input example tat starts te cain. Te procedure can easily be generalized to input or idden units tat are not binary (e.g., Gaussian or exponential, for continuousalued units (Welling et al., 2005; Bengio et al., RBMs are Uniersal Approximators We will now proe tat RBMs wit a data-selected number of idden units become non-parametric and possess uniersal approximation properties relating tem closely to classical multilayer neural networks, but in te context of probabilistic unsuperised learning of an input distribution. 2.1 Better Model wit Increasing Number of Units We sow below tat wen te number of idden units of an RBM is increased, tere are weigt alues for te new units tat guarantee improement in te training loglikeliood or equialently in te KL diergence between te data distribution p 0 and te 0 3

4 model distribution p = p. Tese are equialent since KL(p 0 p = p 0 ( log p 0( p( = H(p 0 1 N N log p( (i i=1 wen p 0 is te empirical distribution, wit (i te i t training ector and N te number of training ectors. Consider te objectie of approximating an arbitrary distribution p 0 wit an RBM. Let p denote te distribution oer isible units obtained wit an RBM tat as n idden units and p w,c denote te input distribution obtained wen adding a idden unit wit weigts w and bias c to tat RBM. Te RBM wit tis extra unit as te same weigts and biases for all oter idden units, and te same input biases. Lemma 2.1. Let R p be te equialence class containing te RBMs wose associated marginal distribution oer te isible units is p. Te operation of adding a idden unit to an RBM of R p preseres te equialence class. Tus, te set of RBMs composed of an RBM of R p and an additional idden unit is also an equialence class (meaning tat all te RBMs of tis set ae te same marginal distribution oer isible units. Proof in appendix. R p will be used ere to denote any RBM in tis class. We also define R pw,c as te set of RBMs obtained by adding a idden unit wit weigt w and bias c to an RBM from R p and p w,c te associated marginal distribution oer te isible units. As demonstrated in te aboe lemma, tis does not depend on wic particular RBM from R p we coose. We ten wis to proe, tat, regardless p and p 0, if p p 0, tere exists a pair (w, c suc tat KL(p 0 p w,c < KL(p 0 p, i.e., tat one can improe te approximation of p 0 by inserting an extra idden unit wit weigt ector w and bias c. We will first state a triial lemma needed for te rest of te proof. It says tat inserting a unit wit bias c = does not cange te input distribution associated wit te RBM. Lemma 2.2. Let p be te distribution oer binary ectors in {0, 1} d, obtained wit an RBM R p and let p w,c be te distribution obtained wen adding a idden unit wit weigts w and bias c to R p. Ten p, w R d, p = p w, [ ] [ ] [ ] Proof. Denoting =, W W = n+1 w T and C C = were w c T denotes te transpose of w and introducing z(, = exp( T W + B T + C T, we can express p(, and p w,c (, as follows: p(, z(, p w,c (, exp ( T W + B T + C T z(, exp ( n+1 w T + c n+1 4

5 If c =, p w,c (, = 0 if n+1 = 1. Tus, we can discard all terms were n+1 = 1, keeping only tose were n+1 = 0. Marginalizing oer te idden units, we ae: z(, p( = (0, 0 z(0, (0 e z(, exp ( n+1 w T + c n+1 p w, ( = ( g (0, 0 z(0, (0 exp (0 n+1 wt + c (0 n+1 z(, exp(0 = (0, 0 z(0, (0 exp(0 = p( We now state te main teorem. Teorem 2.3. Let p 0 be an arbitrary distribution oer {0, 1} n and let R p be an RBM wit marginal distribution p oer te isible units suc tat KL(p 0 p > 0. Ten tere exists an RBM R pw,c composed of R p and an additional idden unit wit parameters (w, c wose marginal distribution p w,c oer te isible units aciees KL(p 0 p w,c < KL(p 0 p. Proof in appendix. 2.2 A Huge Model can Represent Any Distribution Te second set of results are for te limit case wen te number of idden units is ery large, so tat we can represent any discrete distribution exactly. Teorem 2.4. Any distribution oer {0, 1} n can be approximated arbitrarily well (in te sense of te KL diergence wit an RBM wit k + 1 idden units were k is te number of input ectors wose probability is not 0. Proof sketc (Uniersal approximator property. We constructiely build an RBM wit as many idden units as te number of input ectors wose probability is strictly positie. Eac idden unit will be assigned to one input ector. Namely, wen i is te isible units ector, all idden units ae a probability 0 of being on except te one corresponding to i wic as a probability sigm(λ i of being on. Te alue of λ i is directly tied wit p( i. On te oter and, wen all idden units are off but te i t one, p( i = 1. Wit probability 1 sigm(λ i, all te idden units are turned off, wic yields independent draws of te isible units. Te proof consists in finding te appropriate weigts (and alues λ i to yield tat beaiour. Proof in appendix. 5

6 3 Representational power of Deep Belief Networks 3.1 Background on Deep Belief Networks A DBN wit l layers models te joint distribution between obsered ariables j and l idden layers (k, k = 1,..., l made of binary units (k i (ere all binary ariables, as follows: p(, (1, (2,..., (l = P ( (1 P ( (1 (2... P ( (l 2 (l 1 p( (l 1, (l Denoting = (0, b (k te bias ector of layer k and W (k te weigt matrix between layer k and layer k + 1, we ae: P ( (k (k+1 = i P ( (k i = 1 (k+1 = sigm P ( (k i (k+1 (factorial conditional distribution b (k i + j W (k ij (k+1 j (2 and p( (l 1, (l is an RBM. Te original motiation found in Hinton et al. (2006 for aing a deep network ersus a single idden layer (i.e., a DBN ersus an RBM was tat te representational power of an RBM would be too limited and tat more capacity could be acieed by aing more idden layers. Howeer, we ae found ere tat an RBM wit enoug idden units can model any discrete distribution. Anoter motiation for deep arcitectures is discussed in Bengio and Le Cun (2007 and Bengio et al. (2007: deep arcitectures can represent functions muc more efficiently (in terms of number of required parameters tan sallow ones. In particular, teoretical results on circuit complexity teory proe tat sallow digital circuits can be exponentially less efficient tan deeper ones (Ajtai, 1983; Hastad, 1987; Allender, Hence te original motiation (Hinton et al., 2006 was probably rigt wen one considers te restriction to reasonably sized models. 3.2 Trying to Anticipate a Hig-Capacity Top Layer In te greedy training procedure of Deep Belief Networks proposed in (Hinton et al., 2006, one layer is added on top of te network at eac stage, and only tat top layer is trained (as an RBM, see figure 1. In tat greedy pase, one does not take into account te fact tat oter layers will be added next. Indeed, wile trying to optimize te weigts, we restrict te marginal distribution oer its idden units to be te one induced by te RBM. On te contrary, wen we add a new layer, tat distribution (wic is te marginal distribution oer te isible units of te new RBM does not ae tat restriction (but anoter one wic is to be representable by an RBM of a gien size. Tus, we migt be able to better optimize te weigts of te RBM, knowing tat te marginal distribution oer te idden units will ae more freedom wen extra layers are added. Tis would lead to an alternatie training criterion for DBNs. 6

7 (3 RBM (2 (2 RBM W (2 (1 (1 (1 RBM W (1 W (1 (a Stage 1 (b Stage 2 (c Stage 3 Figure 1: Greedy learning of an RBM. After eac RBM as been trained, te weigts are frozen and a new layer is added. Te new layer is trained as an RBM. Consider a 2-layer DBN (l = 2, tat is wit tree layers in total. To train te weigts between (1 and (2 (see figure 1, te greedy strategy maximizes a lower bound on te likeliood of te data (instead of te likeliood itself, called te ariational bound (Hinton et al., 2006: log p( [ ] Q( (1 log p( (1 + log P ( (1 (1 (1 Q( (1 log Q( (1 (3 were Q( (1 is te posterior on idden units (1 gien isible ector, according to te first RBM model, and is determined by W (1. It is te assumed distribution used in te ariational bound on te DBN likeliood. p( (1 is te marginal distribution oer 1 in te DBN (tus induced by te second RBM, between (1 and (2 P ( (1 is te posterior oer gien 1 in te DBN and in te first RBM, and is determined by W (1. Once te weigts of te first layer (W (1 are frozen, te only element tat can be optimized is p( (1. We can sow tat tere is an analytic formulation for te distribution p ( (1 tat maximizes tis ariational bound: p ( (1 = p 0 (Q( (1 (4 7

8 were p 0 is te empirical distribution of input examples. One can sample from p ( (1 by first randomly sampling a from te empirical distribution and ten propagating it stocastically troug Q( (1. Using teorem 2.4, tere exists an RBM tat can approximate tis optimal distribution p ( (1 arbitrarily well. Using an RBM tat aciees tis optimal p ( (1 (optimal in terms of te ariational bound, but not necessarily wit respect to te likeliood, we can determine te distribution represented by DBN. Let p 1 be te distribution one obtains wen starting from p 0 clamped in te isible units of te lower layer (, sampling te idden units (1 gien and ten sampling a gien (1. Proposition 3.1. In a 2-layer DBN, using a second layer RBM acieing p ( (1, te model distribution p is equal to p 1. Tis is equialent to making one up-down in te first RBM trained. Proof. We can write te marginal p ( (1 by summing oer idden alues 0 : p ( (1 = 0 p 0 ( 0 Q( (1 0. Tus, te probability of te data under te 2-layer DBN wen te top-layer RBM aciees p ( (1 is p( (0 = (1 P ( (0 (1 p ( (1 (5 = 0 p 0 ( 0 (1 Q( (1 0 P ( (0 (1 p( (0 = p 1 ( (0 (6 Te last line can be seen to be true by considering te stocastic process of first picking an 0 from te empirical distribution p 0, ten sampling an (1 from Q( (1 0, and finally computing te probability of (0 under P ( (0 (1 for tat (1. Proposition 3.1 tells us tat, een wit te best possible model for p( (1, (2 according to te ariational bound (i.e., te model tat can aciee p ( (1, we obtain a KL diergence between te DBN and te data equal to KL(p 0 p 1. Hence if we train te 2nd leel RBM to model te stocastic output of te 1st leel RBM (as suggested in Hinton et al. (2006, te best KL(p 0 p we can aciee wit model p of te 2-leel DBN cannot be better tan KL(p 0 p 1. Note tat tis result does not preclude tat a better likeliood could be acieed wit p if a better criterion is used to train te 2nd leel RBM. For KL(p 0 p 1 to be 0, one sould ae p 0 = p 1. Note tat a weigt ector wit tis property would not only be a fixed point of KL(p 0 p 1 but also of te likeliood and of contrastie diergence for te first-leel RBM. p 0 = p 1 could ae been obtained wit a one-leel DBN (i.e., a single RBM tat perfectly fits te data. Tis can appen wen 8

9 te first RBM as infinite weigts i.e., is deterministic, and just encodes (0 = in (1 perfectly. In tat case te second layer (2 seems useless. Does tat mean tat adding layers is useless? We beliee te answer is no; first, een toug aing te distribution tat maximizes te ariational bound yields p = p 1, tis does not mean tat we cannot aciee KL(p 0 p < KL(p 0 p 1 wit a 2-layer DBN (toug we ae no proof tat it can be acieed eiter. Indeed, since te ariational bound is not te quantity we truly want to optimize, anoter criterion migt lead to a better model (in terms of te likeliood of te data. Besides tat, een if adding layers does not allow us to perfectly fit te data (wic migt actually only be te case wen we optimize te ariational bound rater tan te likeliood, te distribution of te 2-layer DBN is closer to te empirical distribution tan is te first layer RBM (we do only one up-down Gibbs step instead of doing an infinite number of suc steps. Furtermore, te extra layers allow us to regularize and opefully obtain a representation in wic een a ery ig capacity top layer (e.g., a memory-based non-parametric density estimator could generalize well. Tis approac suggests using alternatie criteria to train DBNs, tat approximate KL(p 0 p 1 and can be computed before (2 is added, but, unlike contrastie diergence, take into account te fact tat more layers will be added later. Note tat computing KL(p 0 p 1 exactly is intractable in an RBM because it inoles summing oer all possible alues of te idden ector. One could use a sampling or mean-field approximation (replacing te summation oer alues of te idden unit ector by a sample or a mean-field alue, but een ten tere would remain a double sum oer examples: N i=1 1 N N log j=1 1 N ˆP (V 1 = i V 0 = j were i denotes te i-t example and ˆP (V 1 = i V 0 = j denotes an estimator of te probability of obsering V 1 = i at iteration 1 of te Gibbs cain (tat is after a updown pass gien tat te cain is started from V 0 = j. We write ˆP rater tan P because computing P exactly migt inole an intractable summation oer all possible alues of. In Bengio et al. (2007, te reconstruction error for training an autoencoder corresponding to one layer of a deep network is log ˆP (V 1 = i V 0 = i. Hence log ˆP (V 1 = i V 0 = j is like a reconstruction error wen one tries to reconstruct or predict i according to P ( i wen starting from j, sampling from P ( j. Tis criterion is essentially te same as one already introduced in a different context in (Bengio, Larocelle, & Vincent, 2006b, were ˆP (V 1 = i V 0 = j is computed deterministically (no idden random ariable is inoled, and te inner sum (oer j s is approximated by using only te 5 nearest neigbours of i in te training set. Howeer, te oerall computation time in (Bengio et al., 2006b is O(N 2 because like most nonparametric learning algoritms it inoles comparing all training examples wit eac oter. In contrast, te contrastie diergence gradient estimator can be computed in O(N for a training set of size N. To ealuate weter tractable approximations of KL(p 0 p 1 would be wort inestigating, we performed an experiment on a toy dataset and toy model were te computations are feasible. Te data are 10-element bit ectors wit patterns of 1, 2 or 3 consecutie 9

10 CD-DBN(train KLp0p1-DBN(train CD-DBN(test KLp0p1-DBN(test 1 KL-di-to-p Figure 2: KL diergence w.r.t. number epocs after adding te 2nd leel RBM, between empirical distribution p 0 (eiter training or test set and (top cures DBN trained greedily wit contrastie diergence at eac layer, or (bottom cures DBN trained greedily wit KL(p 0 p 1 on te 1st layer, and contrastie diergence on te 2nd. stage ones (or zeros against a background of zeros (or ones, demonstrating simple sift inariance. Tere are 60 possible examples (p 0, 40 of wic are randomly cosen to train first an RBM wit 5 binomial idden units, and ten a 2-layer DBN. Te remaining 20 are a test set. Te second RBM as 10 idden units (so tat we could guarantee improement of te likeliood by te addition of te second layer. Te first RBM is eiter trained by contrastie diergence or to minimize KL(p 0 p 1, using gradient descent and a learning rate of 0.1 for 500 epocs (parameters are updated after eac epoc. Oter learning rates and random initialization seeds gae similar results, dierged, or conerged slower. Te second RBM is ten trained for te same number of epocs, by contrastie diergence wit te same learning rate. Figure 2 sows te exact KL(p 0 p of te DBN p wile training te 2nd RBM. Te adantage of te KL(p 0 p 1 training is clear. Tis suggests tat future researc sould inestigate tractable approximations of KL(p 0 p Open Questions on DBN Representational Power Te results described in te preious section were motiated by te following question: since an RBM can represent any distribution, wat can be gained by adding layers to a DBN, in terms of representational power? More formally, let Rl n be a Deep Belief Network wit l + 1 idden layers, eac of tem composed of n units. Can we say 10

11 someting about te representational power of Rl n as l increases? Denoting Dl n te set of distributions one can obtain wit Rl n, it follows from te unfolding argument in Hinton et al. (2006 tat Dl n Dn l+1. Te unfolding argument sows tat te last layer of an l-layer DBN corresponds to an infinite directed grapical model wit tied weigts. By untying te weigts in te (l + 1-t RBM of tis construction from tose aboe, we obtain an (l + 1-layer DBN. Hence eery element of Dl n can be represented in Dl+1 n. Two questions remain: do we ae Dl n Dn l+1, at least for l = 1? wat is D n? 4 Conclusions We ae sown tat wen te number of idden units is allowed to ary, Restricted Boltzmann Macines are ery powerful and can approximate any distribution, eentually representing tem exactly wen te number of idden units is allowed to become ery large (possibly 2 to te number of inputs. Tis only says tat parameter alues exist for doing so, but it does not prescribe ow to obtain tem efficiently. In addition, te aboe result is only concerned wit te case of discrete inputs. It remains to be sown ow to extend tat type of result to te case of continuous inputs. Restricted Boltzmann Macines are interesting ciefly because tey are te building blocks of Deep Belief Networks, wic can ae many layers and can teoretically be muc more efficient at representing complicated distributions (Bengio & Le Cun, We ae introduced open questions about te expressie power of Deep Belief Networks. We ae not answered tese questions, but in trying to do so, we obtained an apparently puzzling result concerning Deep Belief Networks: te best tat can be acieed by adding a second layer (wit respect to some bound is limited by te first layer s ability to map te data distribution to someting close to itself (KL(p 0 p 1, and tis ability is good wen te first layer is large and models well te data. So wy do we need te extra layers? We beliee tat te answer lies in te ability of a Deep Belief Network to generalize better by aing a more compact representation. Tis analysis also suggests to inestigate KL(p 0 p 1 (or an efficient approximation of it as a less greedy alternatie to contrastie diergence for training eac layer, because it would take into account tat more layers will be added. Acknowledgements Te autors would like to tank te following funding organizations for support: NSERC, MITACS, and te Canada Researc Cairs. Tey are also grateful for te elp and comments from Oliier Delalleau and Aaron Courille. 11

12 References Ackley, D., Hinton, G., & Sejnowski, T. (1985. A learning algoritm for Boltzmann macines. Cognitie Science, 9. Ajtai, M. ( formulae on finite structures. Annals of Pure and Applied Logic, 24 (1, 48. Allender, E. (1996. Circuit complexity before te dawn of te new millennium. In 16t Annual Conference on Foundations of Software Tecnology and Teoretical Computer Science, pp Lecture Notes in Computer Science Bengio, Y., Lamblin, P., Popoici, D., & Larocelle, H. (2007. Greedy layer-wise training of deep networks. In Scölkopf, B., Platt, J., & Hoffman, T. (Eds., Adances in Neural Information Processing Systems 19. MIT Press. Bengio, Y., Delalleau, O., & Le Roux, N. (2006a. Te curse of igly ariable functions for local kernel macines. In Weiss, Y., Scölkopf, B., & Platt, J. (Eds., Adances in Neural Information Processing Systems 18, pp MIT Press, Cambridge, MA. Bengio, Y., Larocelle, H., & Vincent, P. (2006b. Non-local manifold parzen windows. In Weiss, Y., Scölkopf, B., & Platt, J. (Eds., Adances in Neural Information Processing Systems 18, pp MIT Press. Bengio, Y., & Le Cun, Y. (2007. Scaling learning algoritms towards AI. In Bottou, L., Capelle, O., DeCoste, D., & Weston, J. (Eds., Large Scale Kernel Macines. MIT Press. Carreira-Perpiñan, M., & Hinton, G. (2005. On contrastie diergence learning. In Proceedings of te Tent International Worksop on Artificial Intelligence and Statistics, Jan 6-8, 2005, Saanna Hotel, Barbados. Hastad, J. T. (1987. Computational Limitations for Small Dept Circuits. MIT Press. Hinton, G. E., Osindero, S., & Te, Y. (2006. A fast learning algoritm for deep belief nets. Neural Computation, 18, Hinton, G. (2002. Training products of experts by minimizing contrastie diergence. Neural Computation, 14, Hinton, G., & Salakutdino, R. (2006. Reducing te dimensionality of data wit neural networks. Science, 313 (5786, Hinton, G. (1999. Products of experts. In Proceedings of te Nint International Conference on Artificial Neural Networks (ICANN, Vol. 1, pp Hornik, K., Stinccombe, M., & Wite, H. (1989. Multilayer feedforward networks are uniersal approximators. Neural Networks, 2,

13 Ranzato, M., Poultney, C., Copra, S., & LeCun, Y. (2007. Efficient learning of sparse representations wit an energy-based model. In Scölkopf, B., Platt, J., & Hoffman, T. (Eds., Adances in Neural Information Processing Systems 19. MIT Press. Salakutdino, R., & Hinton, G. (2007. Learning a nonlinear embedding by presering class neigbourood structure. In To Appear in Proceedings of AISTATS Tesauro, G. (1992. Practical issues in temporal difference learning. Macine Learning, 8, Welling, M., Rosen-Zi, M., & Hinton, G. (2005. Exponential family armoniums wit an application to information retrieal. In Saul, L., Weiss, Y., & Bottou, L. (Eds., Adances in Neural Information Processing Systems 17. MIT Press. 13

14 5 Appendix 5.1 Proof of Lemma 2.1 [ ] [ ] [ ] Proof. Denoting =, W W = n+1 w T and C C = were w c T denotes te transpose of w and introducing z(, = exp( T W + B T + C T, we can express p(, and p w,c (, as follows: p(, z(, p w,c (, exp ( T W + B T + C T z(, exp ( n+1 w T + c n+1 Expanding te expression of p w,c ( and regrouping te terms similar to te expression of p(, we get: e exp ( T W + n+1 w T + B T + C T + c n+1 p w,c ( = = = ( g exp (0T W 0 + (0 (0, 0 n+1 wt 0 + B T 0 + C T (0 + c (0 n+1 z(, ( 1 + exp ( w T + c (0, 0 z(0, (0 (1 + exp (w T 0 + c ( ( 1 + exp w T + c z(, 0 (1 + exp (wt 0 + c 0 z(0, (0 But z(, = p(k wit K =, z(,. Tus, ( ( 1 + exp w T + c p( p w,c ( = 0 (1 + exp (wt 0 + c p( 0 wic does not depend on our particular coice of R p (since it does only depend on p. 5.2 Proof of teorem 2.3 Proof. Expanding te expression of p w,c ( and regrouping te terms similar to te expression of p(, we get: e exp ( T W + n+1 w T + B T + C T + c n+1 p w,c ( = = = ( g exp (0T W 0 + (0 (0, 0 n+1 wt 0 + B T 0 + C T (0 + c (0 n+1 z(, ( 1 + exp ( w T + c (0, 0 z(0, (0 (1 + exp (w T 0 + c ( ( 1 + exp w T + c z(, 0, (0 (1 + exp (wt 0 + c z( 0, (0 14

15 Terefore, we ae: KL(p 0 p w,c = p 0 ( log p 0 ( p 0 ( log p w,c ( = H(p 0 ( ( ( 1 + exp w T + c z(, p 0 ( log 0, (0 (1 + exp (wt 0 + c z( 0, (0 = H(p 0 p 0 ( log ( 1 + exp ( w T + c ( p 0 ( log z(, + p 0 ( log ( ( 1 + exp w T 0 + c z( 0, (0 0, (0 Assuming w T + c is a ery large negatie alue for all, we can use te logaritmic series identity around 0 (log(1 + x = x + o x 0 (x for te second and te last term. Te second term becomes 1 p 0 ( log ( 1 + exp ( w T + c = p 0 ( exp ( w T + c + o c (exp(c and te last term becomes ( p 0 ( log 0, (0 ( ( 1 + exp w T 0 + c z( 0, (0 = log ( z( 0, (0 + log 1 + 0, (0 = log z( 0, (0 + 0, (0 0, (0 exp ( w T 0 + c z( 0, (0 0, (0 z(0, (0 0, (0 exp ( w T 0 + c z( 0, (0 0, (0 z(0, (0 + o c (exp(c But 0, (0 exp ( w T 0 + c z( 0, (0 0, (0 z(0, (0 = = exp ( w T + c exp ( w T + c p( (0 z(, (0 0, (0 z(0, (0 1 f(x o x ( notation: f(x = o x + (g(x if lim x + exists and equals 0. g(x 15

16 Putting all terms back togeter, we ae KL(p 0 p w,c = H(p 0 p 0 ( exp ( w T + c + p( exp ( w T + c + o c (exp(c ( p 0 ( log z(, + log z( 0, (0 0, (0 = KL(p 0 p + exp ( w T + c (p( p 0 ( + o c (exp(c Finally, we ae KL(p 0 p w,c KL(p 0 p = exp(c exp ( w T (p( p 0 ( + o c (exp(c (7 Te question now becomes: can we find a w suc tat exp ( w T (p( p 0 ( is negatie? As p 0 p, tere is a suc tat p( < p 0 (. Ten tere exists a positie scalar a suc tat ŵ = a ( 12 e (wit e = [1... 1] T yields exp ( ŵ T (p( p 0 ( < 0. Indeed, for, we ae exp(ŵ T exp(ŵ T = exp ( ŵ T ( = exp = exp ( ( a ( 1 T 2 e ( a i ( i 1 ( i i 2 For i suc tat i i > 0, we ae i = 1 and i = 0. Tus, i 1 2 = 1 2 and te term inside te exponential is negatie (since a is positie. For i suc tat i i < 0, we ae i = 0 and i = 1. Tus, i 1 2 = 1 2 and te term inside te exponential is also negatie. Furtermore, te terms come close to 0 as a goes to infinity. Since te sum can be decomposed as exp ( ŵ T ( (p( p 0 ( = exp(ŵ T exp ( ŵ T exp(ŵ T (p( p 0( = exp(ŵ T p( p 0 ( + exp ( ŵ T exp(ŵ T (p( p 0( b we ae 2 exp ( ŵ T (p( p 0 ( a + exp(ŵ T (p( p 0 ( < 0. 2 f(x x notation: f(x x + g(x if lim x + exists and equals 1. g(x 16

17 Terefore, tere is a alue â suc tat, if a > â, exp ( w T (p( p 0 ( < 0. Tis concludes te proof. 5.3 Proof of teorem 2.4 Proof. In te former proof, we ad ( ( 1 + exp w T + c z(, p w,c ( = 0, (0 (1 + exp (wt 0 + c z( 0, (0 Let ṽ ( be an arbitrary input ector and ŵ be defined in te same way as before, i.e. ŵ = a ṽ 1. 2 Now define ĉ = ŵ T ṽ + λ wit λ R. We ae: lim 1 + a exp(ŵt + ĉ = 1 for ṽ 1 + exp(ŵ T ṽ + ĉ = 1 + exp(λ Tus, we can see tat, for ṽ: lim p z(, bw,bc( = a 0 ṽ, (0 z(0, (0 + (0 (1 + exp (ŵt ṽ + ĉ z(ṽ, (0 z(, = 0, (0 z(0, (0 + (0 exp (λ z(ṽ, (0 z(, 1 = 0, (0 z(0, (0 P 1 + exp(λ (0 z(ṽ, (0 P 0, (0 z( 0, (0 z(, Remembering p( = 0, (0 z(0, (0, we ae for ṽ: Similarly, we can see tat lim p bw,bc( = a lim p bw,bc(ṽ = a p( 1 + exp(λp(ṽ [1 + exp(λ]p(ṽ 1 + exp(λp(ṽ Depending on te alue of λ, one can see tat adding a idden unit allows one to increase te probability of an arbitrary ṽ and to uniformly decrease te probability of eery oter by a multiplicatie factor. Howeer, one can also see tat, if p(ṽ = 0, ten p bw,bc (ṽ = 0 for all λ. We can terefore build te desired RBM as follows. Let us index te s oer te integers from 1 to 2 n and sort tem suc tat p 0 ( k+1 =... = p 0 ( 2 n = 0 < p 0 ( 1 p 0 ( 2... p 0 ( k (8 (9 17

18 Let us denote p i te distribution of an RBM wit i idden units. We start wit an RBM wose weigts and biases are all equal to 0. Te marginal distribution oer te isible units induced by tat RBM is te uniform distribution. Tus, p 0 ( 1 =... = p 0 ( 2 n = 2 n We define w 1 = a 1 ( and c 1 = w T λ 1. As sown before, we now ae: lim a 1 + p1 ( 1 = [1 + exp(λ 1]2 n 1 + exp(λ 1 2 n lim a 1 + p1 ( i = 2 n 1 + exp(λ 1 2 n i 2 As we can see, we can set p 1 ( 1 to a alue arbitrarily close to 1, wit a uniform distribution oer 2,..., 2 n. Ten, we can coose λ 2 suc tat p2 ( 2 p 2 ( 1 = p( 2. Tis is p( 1 possible since we can arbitrarily increase p 2 ( 2 wile multiplying te oter probabilities by a constant factor and since p( 2 p( 1 p1 ( 2 p 1. We can continue te procedure until ( 1 obtaining p k ( k. Te ratio pi ( j p i does not depend on te alue of i as long as i > j ( j 1 (because at eac suc step i, te two probabilities are multiplied by te same factor. We will ten ae p k ( k p k ( k 1 = p( k p( k 1,..., pk ( 2 p k ( 1 = p( 2 p( 1 p k ( k+1 =... = p k ( 2 n From tat, we can deduce tat p k ( 1 = ν k p( 1,..., p k ( k = ν k p( k wit ν k = 1 (2 n kp k ( 2 n. We also ae pk ( 1 p k ( 2 n = p1 ( 1 p 1 ( 2 n = 1 + exp(λ 1. Tus, p k ( 1 = p( 1 [1 (2 n kp k ( 2 n] = (1 + exp(λ 1 p k ( 2 n. Soling te aboe equations, we ae p k ( i = p( 1 1+exp(λ 1 +p( 1 (2 n k for i > k (10 1+exp(λ 1 p k ( i = p( i 1+exp(λ 1 +p( 1 (2 n k for i k (11 Using te logaritmic series identity around 0 (log(1 + x = x + o x 0 (x for KL(p p k wen λ 1 goes to infinity, we ae KL(p p k = i Tis concludes te proof. p( i (2n kp( i 1 + exp(λ 1 + o(exp( λ 1 λ 1 0 (12 18

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks

1 Representational Power of Restricted Boltzmann Machines and Deep Belief Networks Nicolas Le Roux and Yoshua Bengio Dept. IRO, Uniersité de Montréal C.P. 6128, Montreal, Qc, H3C 3J7, Canada {lerouxni,bengioy}@iro.umontreal.ca