Approximation properties of DBNs with binary hidden units and real-valued visible units

Size: px

Start display at page:

Download "Approximation properties of DBNs with binary hidden units and real-valued visible units"

Tracy Leonard
5 years ago
Views:

1 Approximation properties of DBNs with binary hidden units and real-valued visible units Oswin Krause Department of Computer Science, University of Copenhagen, 100 Copenhagen, Denmark Asja Fischer Institut für Neuroinformatik, Ruhr-Universität Bochum, Bochum, Germany, and Department of Computer Science, University of Copenhagen, 100 Copenhagen, Denmark Tobias Glasmachers Institut für Neuroinformatik, Ruhr-Universität Bochum, Bochum, Germany Christian Igel Department of Computer Science, University of Copenhagen, 100 Copenhagen, Denmark Abstract Deep belief networks DBNs can approximate any distribution over fixed-length binary vectors. However, DBNs are frequently applied to model real-valued data, and so far little is known about their representational power in this case. We analyze the approximation properties of DBNs with two layers of binary hidden units and visible units with conditional distributions from the onential family. It is shown that these DBNs can, under mild assumptions, model any additive mixture of distributions from the onential family with independent variables. An arbitrarily good approximation in terms of Kullback-Leibler divergence of an m-dimensional mixture distribution with n components can be achieved by a DBN with m visible variables and n and n + 1 hidden variables in the first and second hidden layer, respectively. Furthermore, relevant infinite mixtures can be approximated arbitrarily well by a DBN with a finite number of neurons. This includes the important special case of an infinite mixture of Gaussian distributions with fixed variance restricted to a compact domain, which in turn can approximate any strictly positive density over this domain. Proceedings of the 30 th International Conference on Machine Learning, Atlanta, Georgia, USA, 013. JMLR: W&CP volume 8. Copyright 013 by the authors. 1. Introduction Restricted Boltzmann machines RBMs, Smolensky, 1986; Hinton, 00 and deep belief networks DBNs, Hinton et al., 006; Hinton & Salakhutdinov, 006 are probabilistic models with latent and observable variables, which can be interpreted as stochastic neural networks. Binary RBMs, in which each variable conditioned on the others is Bernoulli distributed, are able to approximate arbitrarily well any distribution over the observable variables Le Roux & Bengio, 008; Montufar & Ay, 011. Binary deep belief networks are built by layering binary RBMs, and the representational power does not decrease by adding layers Le Roux & Bengio, 008; Montufar & Ay, 011. In fact, it can be shown that a binary DBN never needs more variables than a binary RBM to model a distribution with a certain accuracy Le Roux & Bengio, 010. However, arguably the most prominent applications in recent times involving RBMs consider models in which the visible variables are real-valued e.g., Salakhutdinov & Hinton, 007; Lee et al., 009; Taylor et al., 010; Le Roux et al., 011. Welling et al. 005 proposed a notion of RBMs where the conditional distributions of the observable variables given the latent variables and vice versa are almost arbitrarily chosen from the onential family. This includes the important special case of the Gaussian-binary RBM GB-RBM, also Gaussian-Bernoulli RBM, an RBM with binary hidden and Gaussian visible variables. Despite their frequent use, little is known about the

2 approximation capabilities of RBMs and DBNs modeling continuous distributions. Clearly, orchestrating a set of Bernoulli distributions to model a distribution over binary vectors is easy compared to approximating distributions over R m. Recently, Wang et al. 01 have emphasized that the distribution of the visible variables represented by a GB-RBM with n hidden units is a mixture of n Gaussian distributions with means lying on the vertices of a projected n- dimensional hyperparallelotope. This limited flexibility makes modeling even a mixture of a finite number of Gaussian distributions with a GB-RBM difficult. This work is a first step towards understanding the representational power of DBNs with binary latent and real-valued visible variables. We will show for a subset of distributions relevant in practice that DBNs with two layers of binary hidden units and a fixed family of conditional distribution for the visible units can model finite mixtures of that family arbitrarily well. As this also holds for infinite mixtures of Gaussians with fixed variance restricted to a compact domain, our results imply universal approximation of strictly positive densities over compact sets.. Background This section will recall basic results on approximation properties of mixture distributions and binary RBMs. Furthermore, the considered models will be defined..1. Mixture distributions A mixture distribution p mix v over is a convex combination of simpler distributions which are members of some family G of distributions over parameterized by θ Θ. We define MIXn, G = { n i=1 p mixv ip mix i n i=1 p mixi = 1 and i {1,..., n} : p mix i 0 p mix v i G} as the family of mixtures of n distributions from G. Furthermore, we denote the family of infinite mixtures of distributions from G as CONVG = { pv θpθ dθ Θ pθ dθ = 1 and θ Θ : pθ 0 pv θ G}. θ Li & Barron have shown that for some family of distributions G every element from CONVG can be approximated arbitrarily well by finite mixtures with respect to the Kullback-Leibler divergence KLdivergence: Theorem 1 Li & Barron, 000. Let f CONVG. There exists a finite mixture p mix MIXn, G such that KLf p mix c f γ n, where c f = and γ = 4[log3 e + a] with a = f v θfθ dθ fv θfθ dθ dv sup log fv θ 1 θ 1,θ,v fv θ. The bound is not necessarily finite. However, it follows from previous results by Zeevi & Meir 1997 that for every f and every ɛ > 0 there exists a mixture p mix with n components such that KLf p mix ɛ + c n for some constant c if R m is a compact set and f is continuous and bounded from below by some η > 0 i.e, x : fx η > 0. Furthermore, it follows that for compact R m every continuous density f on can be approximated arbitrarily well by an infinite but countable mixture of Gaussian distributions with fixed variance σ and means restricted to, that is, by a mixture of distributions from the family { G σ = px = 1 x µ x, } πσ m/ µ σ for sufficient small σ... Restricted Boltzmann Machines, 1 An RBM is an undirected graphical model with a bipartite structure Smolensky, 1986; Hinton, 00 consisting of one layer of m visible variables V = V 1,..., V m and one layer of n hidden variables H = H 1,..., H n Λ. The modeled joint distribution is a Gibbs distribution pv, h = 1 Z e Ev,h with energy E and normalization constant Z = Λ e Ev,h dh dv, where the variables of one layer are mutually independent given the state of the other layer...1. Binary-Binary-RBMs In the standard binary RBMs the state spaces of the variables are = {0, 1} m and Λ = {0, 1} n. The energy is given by Ev, h = v T W h v T b c T h with weight matrix W and bias vectors b and c. Le Roux & Bengio showed that binary RBMs are universal approximators for distributions over binary vectors: Theorem Le Roux & Bengio, 008. Any distribution over = {0, 1} m can be approximated arbitrarily

3 well with respect to the KL-divergence with an RBM with k +1 hidden units, where k is the number of input vectors whose probability is not zero. The number of hidden neurons required can be reduced to the minimum number of pairs of input vectors differing in only one component with the property that their union contains all observable patterns having positive probability Montufar & Ay, Exponential-Family RBMs Welling et al. 005 introduced a framework for constructing generalized RBMs called onential family harmoniums. In this framework, the conditional distributions ph i v and pv j h, i = 1,..., n, j = 1,..., m, belong to the onential family. Almost all types of RBMs encountered in practice, including binary RBMs, can be interpreted as onential family harmoniums. The onential family is the class F of probability distributions that can be written in the form px = 1 Z Φ r x T µ r θ, where θ are the parameters of the distribution and Z is the normalization constant. 1 The functions Φ r and µ r, for r = 1,..., k, transform the sample space and the distribution parameters, respectively. Let I be the subset of F where the components of x = x 1,..., x m are independent from each other, that is, I = {p F x : px 1,..., x m = px 1 px px m }. For elements of I the function Φ r can be written as Φ r x = φ r 1 x 1,..., φ r m x m. A prominent subset of I is the family of Gaussian distributions with fixed variance σ, G σ I, see equation 1. Following Welling et al., the energy of an RBM with binary hidden units and visible units with pv h I is given by Ev, h = Φ r v T W r h Φ r v T b r c T h, 3 where Φ r v = φ r 1 x 1,..., φ r m x m. Note that not every possible choice of parameters necessarily leads to a finite normalization constant and thus to a proper distribution. 1 By setting k = 1 and rewriting Φ and µ accordingly, one obtains the standard formulation. If the joint distribution is properly defined, the conditional probability of the visible units given the hidden is pv h = 1 Φ r v T W r h + b r, Z h 4 where Z h is the corresponding normalization constant. Thus, the marginal distribution of the visible units pv can be ressed as a mixture of n conditional distributions: pv = pv hph MIX n, I h {0,1} n.3. Deep Belief Networks A DBN is a graphical model with more than two layers built by stacking RBMs Hinton et al., 006; Hinton & Salakhutdinov, 006. A DBN with two layers of hidden variables H and Ĥ and a visible layer V is characterized by a probability distribution pv, h, ĥ that fulfills pv, h, ĥ = pv hph, ĥ = pĥ hpv, h. In this study we are interested in the approximation properties of DBNs with two binary hidden layers and real-valued visible neurons. We will refer to such a DBN as a B-DBN. With B-DBNG we denote the family of all B-DBNs having conditional distributions pv h G for all h H. 3. Approximation properties This section will present our results on the approximation properties of DBNs with binary hidden units and real-valued visible units. It consists of the following steps: Lemma 3 gives an upper bound on the KLdivergence between a B-DBN and a finite additive mixture model however, under the assumption that the B-DBN contains the mixture components. For mixture models from a subset of I, Lemma 4 and Theorem 5 show that such B-DBNs actually exist and that the KL-divergence can be made arbitrarily small. Corollary 6 specifies the previous theorem for the special case of Gaussian mixtures, showing how the bound can be applied to distributions used in practice.

4 Finally, Theorem 7 generalizes the results to infinite mixture distributions, and thus to the approximation of arbitrary strictly positive densities on a compact set Finite mixtures We first introduce a construction that will enable us to model mixtures of distributions by DBNs. For some family G an arbitrary mixture of distributions p mix v = n i=1 p mixv ip mix i MIXn, G over v can be ressed in terms of a joint probability distribution of v and h {0, 1} n by defining the distribution { pmix i, if h = e q mix h = i, 5 0, else over {0, 1} n, where e i is the ith unit vector. Then we can rewrite p mix v as p mix v = h q mixv hq mix h, where q mix v h G for all h {0, 1} n and q mix v e i = p mix v i for all i = 1,..., n. This can be interpreted as ressing p mix v as an element of MIX n, G with n n mixture components having a probability or weight equal to zero. Now we can model p mix v by the marginal distribution of the visible variables pv = h,ĥ pv hph, ĥ = h pv hph of a B-DBN pv, h, ĥ B-DBNG with the following properties: 1. pv e i = p mix v i for i = 1,..., n and. ph = ĥ ph, ĥ approximates q mixh. Following this line of thoughts we can formulate our first result. It provides an upper bound on the KLdivergence of any element from MIXn, G and the marginal distribution of the visible variables of a B- DBN with the properties stated above, where ph models q mix h with an approximation error smaller than a given ɛ. Lemma 3. Let p mix v = n i=1 p mixv ip mix i MIXn, G be a mixture with n components from a family of distributions G, and q mix h be defined as in 5. Let pv, h, ĥ B-DBNG with the properties pv e i = p mix v i for i = 1,..., n and h {0, 1} n : ph q mix h < ɛ for some ɛ > 0. Then the KLdivergence between p mix and p is bounded by KLp p mix BG, p mix, ɛ, where BG, p mix, ɛ = ɛ αvβv dv + n 1 + ɛ log1 + ɛ with and αv = h pv h βv = log 1 + αv p mix v Proof. Using ph q mix h < ɛ for all h {0, 1} n and p mix v = h pv hq mixh we can write pv = h = h pv hph pv hq mix h + ph q mix h = p mix v + pv hph q mix h h p mix v + αvɛ, where αv is defined as above. Thus, we get for the KL-divergence KLp p mix = p mix v + αvɛ log pv pv log dv p mix v pmix v + αvɛ } {{ p mix v } =F ɛ,v ɛ = 0. dv F ɛ, v d ɛ dv ɛ using F 0, v = 0. Because 1 + xɛ 1 + x1 + ɛ for all x, ɛ 0, we can upper bound ɛf ɛ, v by ɛ [ F ɛ, v = αv αv [ 1 + log 1 + αv ] p mix v ɛ 1 + log 1 + αv 1 + ɛ p mix v ] = αv [1 + βv + log1 + ɛ] with βv as defined above. By integration we get F ɛ, v = ɛ 0 F ɛ, v d ɛ ɛ αvβvɛ + αv1 + ɛ log1 + ɛ. Integration with respect to v completes the proof. The proof does not use the independence properties of pv h. Thus, it is possible to apply this bound also to mixture distributions which do not have conditionally independent variables. However, in this case one has to

5 show that a generalization of the B-DBN exists which can model the target distribution, as the formalism introduced in formula 3 does not cover distributions which are not in I. For a family G I it is possible to construct a B- DBN with the properties required in Lemma 3 under weak technical assumptions. The assumptions hold for families of distributions used in practice, for instance Gaussian and truncated onential distributions. Lemma 4. Let G I and p mix v = n i=1 p mixv ip mix i MIXn, G with p mix v i = 1 Φ r v T µ r θ i Z i 6 for i = 1,..., n and corresponding parameters θ 1,..., θ n. Let the distribution q mix h be defined by equation 5. Assume that there exist parameters b r such that for all c R n the joint distribution pv, h of v R m and h {0, 1} n with energy Ev, h = Φ r v T W r h + b r + c T h is a proper distribution i.e., the corresponding normalization constant is finite, where the ith column of W r is µ r θ i b r. Then the following holds: For all ɛ > 0 there exists a B-DBN with joint distribution pv, h, ĥ = pv hph, ĥ B-DBNG such that i p mix v i = pv e i for i = 1,..., n and ii h {0, 1} n : ph q mix h < ɛ. Proof. Property i follows from equation 4 by setting h = e i and the ith column of W r to µ r θ i b r. Property ii follows directly from applying Theorem to p. For some families of distributions, such as truncated onential or Gaussian distributions with uniform variance, choosing b r = 0 for r = 1,..., k is sufficient to yield a proper joint distribution pv, h and thus a B-DBN with the desired properties. If such a B-DBM exists, one can show, under weak additional assumptions on G I, that the bound shown in Lemma 3 is finite. It follows that the bound decreases to zero as ɛ does. Theorem 5. Let G I be a family of densities and p mix v = n i=1 p mixv ip mix i MIXn, G with p mix v i given by equation 6. Furthermore, let q mix h be given by equation 5 and let pv, h, ĥ B-DBNG with i p mix v i = pv e i for i = 1,..., n ii h {0, 1} n : ph q mix h < ɛ iii h {0, 1} n : pv h Φ r v 1 dv <. Then BG, p mix, ɛ is finite and thus in Oɛ. Proof. We have to show that under the conditions given above αvβv dv is finite. We will first find an upper bound for βv = log 1 + αv p mixv for an arbitrary but fixed v. Since p mix v = p mix v ip mix i is a convex combination, by defining i = arg min i p mix v i and h = i arg max h pv h we get αv p mix v = h pv h p mix v ip mix i n pv h p mix v i. i 7 The conditional distribution p mix v i of the mixture can be written as in equation 6 and the conditional distribution pv h of the RBM can be written as in formula 4. We define and get pv h p mix v i = = m u r h = W r h + b r k φ r j j=1 Φr v T u r h k Φr v T µ r θ i [ ] Φ r v T u r h µ r θ i v u r j h µ r j θ i. Note that the last ression is always larger or equal to one. We can further bound this term by defining ξ r = max u r j h µ r j θ i j,h,i and arrive at pv h p mix v i ξ r Φ r v 1. 8

6 By plugging these results into the formula for βv we obtain [ ] βv 7 log 1 + n pv h p mix v i [ 8 ] log 1 + n ξ r Φ r v 1 [ ] log n+1 ξ r Φ r v 1 = n + 1 log + ξ r Φ r v 1. In the third step, we used that the second term is always larger than 1. Insertion into αvβv dv leads to αvβv dv [ αv n + 1 log + + h ] ξ r Φ r v 1 dv = n n + 1 log ξ r pv h Φ r v 1 dv, 9 which is finite by assumption. 3.. Finite Gaussian mixtures Now we apply Lemma 4 and Theorem 5 to mixtures of Gaussian distributions with uniform variance. The KL-divergence is continuous for strictly positive distributions. Our previous results thus imply that for every mixture p mix of Gaussian distributions with uniform variance and every δ 0 we can find a B-DBN p such that KLp p mix δ. The following corollary gives a corresponding bound: Corollary 6. Let = R m and G σ be the family of Gaussian distributions with variance σ. Let ɛ > 0 and p mix v = n i=1 p mixv ip mix i MIXn, G σ a mixture of n distributions with means z i R m, i = 1,..., n. By D = { z r max r,s {1,...,n} k k {1,...,m} z s k } we denote the edge length of the smallest hypercube containing all means. Then there exists pv, h, ĥ B-DBNG σ, with h {0, 1} n : ph q mix h < ɛ and p mix v i = pv e i, i = 1,..., n, such that KLp p mix ɛ n n + 1 log + m n n σ/d + πσ/d + n 1 + ɛ log1 + ɛ. Proof. In a first step we apply an affine linear transformation to map the hypercube of edge length D to the unit hypercube [0, 1] m. Note that doing this while transforming the B-DBN-distribution accordingly does not change the KL-divergence, but it does change the standard deviation of the Gaussians from σ to σ/d. In other words, it suffices to show the above bound for D = 1 and z i [0, 1] m. The energy of the Gaussian-Binary-RBM pv, h is typically written as Ev, h = 1 σ vt v 1 σ vt b 1 σ vt W h c T h, with weight matrix W and bias vectors b and c. This can be brought into the form of formula 3 by setting k =, φ 1 j v j = v j, φ j v j = vj, W 1 = W /σ, W = 0, b 1 j = b j /σ, and b j = 1/σ. With b = 0 and thus b 1 = 0, it follows from Lemma 4 that a B-DBN pv, h, ĥ = pv hph, ĥ with properties i and ii from Theorem 5 exists. It remains to show that property iii holds. Since the conditional probability factorizes, it suffices to show that iii holds for every visible variable individually. The conditional probability of the jth visible neuron of the constructed B-DBN is given by pv j h = 1 v j z j h πσ σ where the mean z j h is the jth element of W h. Using this, it is easy to see that pv j h φ v j dv j = pv j hv j dv j <, because it is the second moment of the normal distri-,

7 bution. For pv j h φ 1 v j dv j we get 1 πσ = z j h + πσ = z j h+ πσ z j h + v j z j h 0 σ v j dv j v j z j h σ z jh t σ = z j h + z j h pv j hdv j 0 + t πσ z jh σ t dt σ n + π z j h σ v j dv j t+z j hdt σ π. 10 In the last step we used that z j e i = z i j [0, 1] by construction and thus z j h can be bounded from above by n z j h = h i z j e i n. 11 i=0 Thus it follows from Theorem 5 that the bound from Lemma 3 holds and is finite. To get the actual bound, we only need to find the constants ξ 1 and ξ to be inserted into 9. The first constant is given by ξ 1 = max j,h,i z jh σ zi j. σ It can be upper bounded z by max jh j,h σ n σ, as an application of equation 11 shows. The second constant is given by ξ = 1 max j,h,i σ 1 σ = 0. Inserting these variables into inequality 9 leads to the bound. The bound BG σ, p mix, ɛ is also finite when is restricted to a compact subset of R m. This can easily be verified by adapting equation 10 accordingly. Similar results can be obtained for other families of distributions. A prominent example are B-DBMs with truncated onential distributions. In this case the energy function of the first layer is the same as for the binary RBM, but the values of the visible neurons are chosen from the interval [0, 1] instead of {0, 1}. It is easy to see that for every choice of parameters the normalization constant as well as the bound are finite Infinite mixtures We will now transfer our results for finite mixtures to the case of infinite mixtures following Li & Barron 000. Theorem 7. Let G be a family of continuous distributions and f CONVG such that the bound from Theorem 1 is finite for all p mix-n MIXn, G, n N. Furthermore, for all p mix-n MIXn, G, n N, and for all ˆɛ > 0 let there exist a B-DBN in B-DBNG such that BG, p mix, ˆɛ is finite. Then for all ɛ > 0 there exists pv, h, ĥ B-DBNG with KLf p ɛ. Proof. From Theorem 1 and the assumption that the corresponding bound is finite it follows that for all ɛ > 0 there exists a mixture p mix-n MIXn, G with n c f γ/ɛ such that KLf p mix-n ɛ. By assumption there exists a B-DBN B-DBNG such that BG, p mix-n, ˆɛ is finite. Thus, one can define a sequence of B-DBNs pˆɛ ˆɛ B-DBNG with ˆɛ decaying to zero where the B-DBNs only differ in the weights between the hidden layers for which it holds KLpˆɛ p mix-n ˆɛ 0 0. This implies that ˆɛ 0 pˆɛ p mix-n uniformly. It follows KLf pˆɛ ˆɛ 0 KLf p mix-n. Thus, there exists ɛ such that KLf p ɛ KLf p mix-n < ɛ/. A combination of these inequalities yields KLf p ɛ KLf p ɛ KLf p mix-n +KLf p mix-n ɛ. This result applies to infinite mixtures of Gaussians with the same fixed but arbitrary variance σ in all components. In the limit σ 0 such mixtures can approximate strictly positive densities over compact sets arbitrarily well Zeevi & Meir, Conclusions We presented a step towards understanding the representational power of DBNs for modeling real-valued data. When binary latent variables are considered, DBNs with two hidden layers can already achieve good approximation results. Under mild constraints, we showed that for modeling a mixture of n pairwise independent distributions, a DBN with only n + 1 binary hidden units is sufficient to make the KL-divergence between the mixture p mix and the DBN distribution p arbitrarily small i.e., for every δ > 0 we can find a DBN such that KLp p mix < δ. This holds for deep architectures used in practice, for instance DBNs having visible neurons with Gaussian or truncated onential conditional distributions, and corresponding mixture distributions having components of the same type as the visible units of the DBN. Furthermore, we extended these results to infinite mixtures and showed that these can be approximated arbitrarily well by

8 a DBN with a finite number of neurons. Therefore, Gaussian-binary DBNs inherit the universal approximation properties from additive Gaussian mixtures, which can model any strictly positive density over a compact domain with arbitrarily high accuracy. Acknowledgments OK and CI acknowledge support from the Danish National Advanced Technology Foundation through project Personalized breast cancer screening, AF and CI acknowledge support from the German Federal Ministry of Education and Research within the Bernstein Fokus Learning behavioral models: From human eriment to technical assistance. References Hinton, Geoffrey E. Training products of erts by minimizing contrastive divergence. Neural Computation, 148: , 00. Hinton, Geoffrey E. and Salakhutdinov, Ruslan. Reducing the dimensionality of data with neural networks. Science, : , 006. Hinton, Geoffrey E., Osindero, Simon, and Teh, Yee- Whye. A fast learning algorithm for deep belief nets. Neural Computation, 187: , 006. Le Roux, Nicolas and Bengio, Yoshua. Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation, 06: , 008. Le Roux, Nicolas and Bengio, Yoshua. Deep belief networks are compact universal approximators. Neural Computation, 8:19 07, 010. Le Roux, Nicolas, Heess, Nicolas, Shotton, Jamie, and Winn, John M. Learning a generative model of images by factoring appearance and shape. Neural Computation, 33: , 011. Lee, Honglak, Grosse, Roger, Ranganath, Rajesh, and Ng, Andrew Y. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 6th Annual International Conference on Machine Learning ICML, pp ACM, 009. Montufar, Guido and Ay, Nihat. Refinements of universal approximation results for deep belief networks and restricted Boltzmann machines. Neural Computation, 35: , 011. Salakhutdinov, Ruslan and Hinton, Geoffrey E. Learning a nonlinear embedding by preserving class neighbourhood structure. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics AISTATS, volume, pp , 007. Smolensky, Paul. Information processing in dynamical systems: foundations of harmony theory. In Rumelhart, David E. and McClelland, James L. eds., Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1, pp MIT Press, Taylor, Graham W., Fergus, Rob, LeCun, Yann, and Bregler, Christoph. Convolutional learning of spatio-temporal features. In Computer Vision ECCV 010, volume 6316 of LNCS, pp Springer, 010. Wang, Nan, Melchior, Jan, and Wiskott, Laurenz. An analysis of Gaussian-binary restricted Boltzmann machines for natural images. In Verleysen, Michel ed., Proceedings of the 0th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning ESANN 01, pp Evere, Belgium: d-side publications, 01. Welling, Max, Rosen-Zvi, Michal, and Hinton, Geoffrey E. Exponential family harmoniums with an application to information retrieval. In Saul, Lawrence K., Weiss, Yair, and Bottou, Léon eds., Advances in Neural Information Processing Systems 17 NIPS 004, pp MIT Press, 005. Zeevi, Assaf J. and Meir, Ronny. Density estimation through convex combinations of densities: Approximation and estimation bounds. Neural Networks, 101:99 109, Li, Jonathan Q. and Barron, Andrew R. Mixture density estimation. In Solla, Sara A., Leen, Todd K., and Müller, Klaus-Robert eds., Advances in Neural Information Processing Systems 1 NIPS 1999, pp MIT Press, 000.

Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines

Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines Asja Fischer and Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum,