Approximation properties of DBNs with binary hidden units and real-valued visible units

Size: px
Start display at page:

Download "Approximation properties of DBNs with binary hidden units and real-valued visible units"

Transcription

1 Approximation properties of DBNs with binary hidden units and real-valued visible units Oswin Krause Department of Computer Science, University of Copenhagen, 100 Copenhagen, Denmark Asja Fischer Institut für Neuroinformatik, Ruhr-Universität Bochum, Bochum, Germany, and Department of Computer Science, University of Copenhagen, 100 Copenhagen, Denmark Tobias Glasmachers Institut für Neuroinformatik, Ruhr-Universität Bochum, Bochum, Germany Christian Igel Department of Computer Science, University of Copenhagen, 100 Copenhagen, Denmark Abstract Deep belief networks DBNs can approximate any distribution over fixed-length binary vectors. However, DBNs are frequently applied to model real-valued data, and so far little is known about their representational power in this case. We analyze the approximation properties of DBNs with two layers of binary hidden units and visible units with conditional distributions from the onential family. It is shown that these DBNs can, under mild assumptions, model any additive mixture of distributions from the onential family with independent variables. An arbitrarily good approximation in terms of Kullback-Leibler divergence of an m-dimensional mixture distribution with n components can be achieved by a DBN with m visible variables and n and n + 1 hidden variables in the first and second hidden layer, respectively. Furthermore, relevant infinite mixtures can be approximated arbitrarily well by a DBN with a finite number of neurons. This includes the important special case of an infinite mixture of Gaussian distributions with fixed variance restricted to a compact domain, which in turn can approximate any strictly positive density over this domain. Proceedings of the 30 th International Conference on Machine Learning, Atlanta, Georgia, USA, 013. JMLR: W&CP volume 8. Copyright 013 by the authors. 1. Introduction Restricted Boltzmann machines RBMs, Smolensky, 1986; Hinton, 00 and deep belief networks DBNs, Hinton et al., 006; Hinton & Salakhutdinov, 006 are probabilistic models with latent and observable variables, which can be interpreted as stochastic neural networks. Binary RBMs, in which each variable conditioned on the others is Bernoulli distributed, are able to approximate arbitrarily well any distribution over the observable variables Le Roux & Bengio, 008; Montufar & Ay, 011. Binary deep belief networks are built by layering binary RBMs, and the representational power does not decrease by adding layers Le Roux & Bengio, 008; Montufar & Ay, 011. In fact, it can be shown that a binary DBN never needs more variables than a binary RBM to model a distribution with a certain accuracy Le Roux & Bengio, 010. However, arguably the most prominent applications in recent times involving RBMs consider models in which the visible variables are real-valued e.g., Salakhutdinov & Hinton, 007; Lee et al., 009; Taylor et al., 010; Le Roux et al., 011. Welling et al. 005 proposed a notion of RBMs where the conditional distributions of the observable variables given the latent variables and vice versa are almost arbitrarily chosen from the onential family. This includes the important special case of the Gaussian-binary RBM GB-RBM, also Gaussian-Bernoulli RBM, an RBM with binary hidden and Gaussian visible variables. Despite their frequent use, little is known about the

2 approximation capabilities of RBMs and DBNs modeling continuous distributions. Clearly, orchestrating a set of Bernoulli distributions to model a distribution over binary vectors is easy compared to approximating distributions over R m. Recently, Wang et al. 01 have emphasized that the distribution of the visible variables represented by a GB-RBM with n hidden units is a mixture of n Gaussian distributions with means lying on the vertices of a projected n- dimensional hyperparallelotope. This limited flexibility makes modeling even a mixture of a finite number of Gaussian distributions with a GB-RBM difficult. This work is a first step towards understanding the representational power of DBNs with binary latent and real-valued visible variables. We will show for a subset of distributions relevant in practice that DBNs with two layers of binary hidden units and a fixed family of conditional distribution for the visible units can model finite mixtures of that family arbitrarily well. As this also holds for infinite mixtures of Gaussians with fixed variance restricted to a compact domain, our results imply universal approximation of strictly positive densities over compact sets.. Background This section will recall basic results on approximation properties of mixture distributions and binary RBMs. Furthermore, the considered models will be defined..1. Mixture distributions A mixture distribution p mix v over is a convex combination of simpler distributions which are members of some family G of distributions over parameterized by θ Θ. We define MIXn, G = { n i=1 p mixv ip mix i n i=1 p mixi = 1 and i {1,..., n} : p mix i 0 p mix v i G} as the family of mixtures of n distributions from G. Furthermore, we denote the family of infinite mixtures of distributions from G as CONVG = { pv θpθ dθ Θ pθ dθ = 1 and θ Θ : pθ 0 pv θ G}. θ Li & Barron have shown that for some family of distributions G every element from CONVG can be approximated arbitrarily well by finite mixtures with respect to the Kullback-Leibler divergence KLdivergence: Theorem 1 Li & Barron, 000. Let f CONVG. There exists a finite mixture p mix MIXn, G such that KLf p mix c f γ n, where c f = and γ = 4[log3 e + a] with a = f v θfθ dθ fv θfθ dθ dv sup log fv θ 1 θ 1,θ,v fv θ. The bound is not necessarily finite. However, it follows from previous results by Zeevi & Meir 1997 that for every f and every ɛ > 0 there exists a mixture p mix with n components such that KLf p mix ɛ + c n for some constant c if R m is a compact set and f is continuous and bounded from below by some η > 0 i.e, x : fx η > 0. Furthermore, it follows that for compact R m every continuous density f on can be approximated arbitrarily well by an infinite but countable mixture of Gaussian distributions with fixed variance σ and means restricted to, that is, by a mixture of distributions from the family { G σ = px = 1 x µ x, } πσ m/ µ σ for sufficient small σ... Restricted Boltzmann Machines, 1 An RBM is an undirected graphical model with a bipartite structure Smolensky, 1986; Hinton, 00 consisting of one layer of m visible variables V = V 1,..., V m and one layer of n hidden variables H = H 1,..., H n Λ. The modeled joint distribution is a Gibbs distribution pv, h = 1 Z e Ev,h with energy E and normalization constant Z = Λ e Ev,h dh dv, where the variables of one layer are mutually independent given the state of the other layer...1. Binary-Binary-RBMs In the standard binary RBMs the state spaces of the variables are = {0, 1} m and Λ = {0, 1} n. The energy is given by Ev, h = v T W h v T b c T h with weight matrix W and bias vectors b and c. Le Roux & Bengio showed that binary RBMs are universal approximators for distributions over binary vectors: Theorem Le Roux & Bengio, 008. Any distribution over = {0, 1} m can be approximated arbitrarily

3 well with respect to the KL-divergence with an RBM with k +1 hidden units, where k is the number of input vectors whose probability is not zero. The number of hidden neurons required can be reduced to the minimum number of pairs of input vectors differing in only one component with the property that their union contains all observable patterns having positive probability Montufar & Ay, Exponential-Family RBMs Welling et al. 005 introduced a framework for constructing generalized RBMs called onential family harmoniums. In this framework, the conditional distributions ph i v and pv j h, i = 1,..., n, j = 1,..., m, belong to the onential family. Almost all types of RBMs encountered in practice, including binary RBMs, can be interpreted as onential family harmoniums. The onential family is the class F of probability distributions that can be written in the form px = 1 Z Φ r x T µ r θ, where θ are the parameters of the distribution and Z is the normalization constant. 1 The functions Φ r and µ r, for r = 1,..., k, transform the sample space and the distribution parameters, respectively. Let I be the subset of F where the components of x = x 1,..., x m are independent from each other, that is, I = {p F x : px 1,..., x m = px 1 px px m }. For elements of I the function Φ r can be written as Φ r x = φ r 1 x 1,..., φ r m x m. A prominent subset of I is the family of Gaussian distributions with fixed variance σ, G σ I, see equation 1. Following Welling et al., the energy of an RBM with binary hidden units and visible units with pv h I is given by Ev, h = Φ r v T W r h Φ r v T b r c T h, 3 where Φ r v = φ r 1 x 1,..., φ r m x m. Note that not every possible choice of parameters necessarily leads to a finite normalization constant and thus to a proper distribution. 1 By setting k = 1 and rewriting Φ and µ accordingly, one obtains the standard formulation. If the joint distribution is properly defined, the conditional probability of the visible units given the hidden is pv h = 1 Φ r v T W r h + b r, Z h 4 where Z h is the corresponding normalization constant. Thus, the marginal distribution of the visible units pv can be ressed as a mixture of n conditional distributions: pv = pv hph MIX n, I h {0,1} n.3. Deep Belief Networks A DBN is a graphical model with more than two layers built by stacking RBMs Hinton et al., 006; Hinton & Salakhutdinov, 006. A DBN with two layers of hidden variables H and Ĥ and a visible layer V is characterized by a probability distribution pv, h, ĥ that fulfills pv, h, ĥ = pv hph, ĥ = pĥ hpv, h. In this study we are interested in the approximation properties of DBNs with two binary hidden layers and real-valued visible neurons. We will refer to such a DBN as a B-DBN. With B-DBNG we denote the family of all B-DBNs having conditional distributions pv h G for all h H. 3. Approximation properties This section will present our results on the approximation properties of DBNs with binary hidden units and real-valued visible units. It consists of the following steps: Lemma 3 gives an upper bound on the KLdivergence between a B-DBN and a finite additive mixture model however, under the assumption that the B-DBN contains the mixture components. For mixture models from a subset of I, Lemma 4 and Theorem 5 show that such B-DBNs actually exist and that the KL-divergence can be made arbitrarily small. Corollary 6 specifies the previous theorem for the special case of Gaussian mixtures, showing how the bound can be applied to distributions used in practice.

4 Finally, Theorem 7 generalizes the results to infinite mixture distributions, and thus to the approximation of arbitrary strictly positive densities on a compact set Finite mixtures We first introduce a construction that will enable us to model mixtures of distributions by DBNs. For some family G an arbitrary mixture of distributions p mix v = n i=1 p mixv ip mix i MIXn, G over v can be ressed in terms of a joint probability distribution of v and h {0, 1} n by defining the distribution { pmix i, if h = e q mix h = i, 5 0, else over {0, 1} n, where e i is the ith unit vector. Then we can rewrite p mix v as p mix v = h q mixv hq mix h, where q mix v h G for all h {0, 1} n and q mix v e i = p mix v i for all i = 1,..., n. This can be interpreted as ressing p mix v as an element of MIX n, G with n n mixture components having a probability or weight equal to zero. Now we can model p mix v by the marginal distribution of the visible variables pv = h,ĥ pv hph, ĥ = h pv hph of a B-DBN pv, h, ĥ B-DBNG with the following properties: 1. pv e i = p mix v i for i = 1,..., n and. ph = ĥ ph, ĥ approximates q mixh. Following this line of thoughts we can formulate our first result. It provides an upper bound on the KLdivergence of any element from MIXn, G and the marginal distribution of the visible variables of a B- DBN with the properties stated above, where ph models q mix h with an approximation error smaller than a given ɛ. Lemma 3. Let p mix v = n i=1 p mixv ip mix i MIXn, G be a mixture with n components from a family of distributions G, and q mix h be defined as in 5. Let pv, h, ĥ B-DBNG with the properties pv e i = p mix v i for i = 1,..., n and h {0, 1} n : ph q mix h < ɛ for some ɛ > 0. Then the KLdivergence between p mix and p is bounded by KLp p mix BG, p mix, ɛ, where BG, p mix, ɛ = ɛ αvβv dv + n 1 + ɛ log1 + ɛ with and αv = h pv h βv = log 1 + αv p mix v Proof. Using ph q mix h < ɛ for all h {0, 1} n and p mix v = h pv hq mixh we can write pv = h = h pv hph pv hq mix h + ph q mix h = p mix v + pv hph q mix h h p mix v + αvɛ, where αv is defined as above. Thus, we get for the KL-divergence KLp p mix = p mix v + αvɛ log pv pv log dv p mix v pmix v + αvɛ } {{ p mix v } =F ɛ,v ɛ = 0. dv F ɛ, v d ɛ dv ɛ using F 0, v = 0. Because 1 + xɛ 1 + x1 + ɛ for all x, ɛ 0, we can upper bound ɛf ɛ, v by ɛ [ F ɛ, v = αv αv [ 1 + log 1 + αv ] p mix v ɛ 1 + log 1 + αv 1 + ɛ p mix v ] = αv [1 + βv + log1 + ɛ] with βv as defined above. By integration we get F ɛ, v = ɛ 0 F ɛ, v d ɛ ɛ αvβvɛ + αv1 + ɛ log1 + ɛ. Integration with respect to v completes the proof. The proof does not use the independence properties of pv h. Thus, it is possible to apply this bound also to mixture distributions which do not have conditionally independent variables. However, in this case one has to

5 show that a generalization of the B-DBN exists which can model the target distribution, as the formalism introduced in formula 3 does not cover distributions which are not in I. For a family G I it is possible to construct a B- DBN with the properties required in Lemma 3 under weak technical assumptions. The assumptions hold for families of distributions used in practice, for instance Gaussian and truncated onential distributions. Lemma 4. Let G I and p mix v = n i=1 p mixv ip mix i MIXn, G with p mix v i = 1 Φ r v T µ r θ i Z i 6 for i = 1,..., n and corresponding parameters θ 1,..., θ n. Let the distribution q mix h be defined by equation 5. Assume that there exist parameters b r such that for all c R n the joint distribution pv, h of v R m and h {0, 1} n with energy Ev, h = Φ r v T W r h + b r + c T h is a proper distribution i.e., the corresponding normalization constant is finite, where the ith column of W r is µ r θ i b r. Then the following holds: For all ɛ > 0 there exists a B-DBN with joint distribution pv, h, ĥ = pv hph, ĥ B-DBNG such that i p mix v i = pv e i for i = 1,..., n and ii h {0, 1} n : ph q mix h < ɛ. Proof. Property i follows from equation 4 by setting h = e i and the ith column of W r to µ r θ i b r. Property ii follows directly from applying Theorem to p. For some families of distributions, such as truncated onential or Gaussian distributions with uniform variance, choosing b r = 0 for r = 1,..., k is sufficient to yield a proper joint distribution pv, h and thus a B-DBN with the desired properties. If such a B-DBM exists, one can show, under weak additional assumptions on G I, that the bound shown in Lemma 3 is finite. It follows that the bound decreases to zero as ɛ does. Theorem 5. Let G I be a family of densities and p mix v = n i=1 p mixv ip mix i MIXn, G with p mix v i given by equation 6. Furthermore, let q mix h be given by equation 5 and let pv, h, ĥ B-DBNG with i p mix v i = pv e i for i = 1,..., n ii h {0, 1} n : ph q mix h < ɛ iii h {0, 1} n : pv h Φ r v 1 dv <. Then BG, p mix, ɛ is finite and thus in Oɛ. Proof. We have to show that under the conditions given above αvβv dv is finite. We will first find an upper bound for βv = log 1 + αv p mixv for an arbitrary but fixed v. Since p mix v = p mix v ip mix i is a convex combination, by defining i = arg min i p mix v i and h = i arg max h pv h we get αv p mix v = h pv h p mix v ip mix i n pv h p mix v i. i 7 The conditional distribution p mix v i of the mixture can be written as in equation 6 and the conditional distribution pv h of the RBM can be written as in formula 4. We define and get pv h p mix v i = = m u r h = W r h + b r k φ r j j=1 Φr v T u r h k Φr v T µ r θ i [ ] Φ r v T u r h µ r θ i v u r j h µ r j θ i. Note that the last ression is always larger or equal to one. We can further bound this term by defining ξ r = max u r j h µ r j θ i j,h,i and arrive at pv h p mix v i ξ r Φ r v 1. 8

6 By plugging these results into the formula for βv we obtain [ ] βv 7 log 1 + n pv h p mix v i [ 8 ] log 1 + n ξ r Φ r v 1 [ ] log n+1 ξ r Φ r v 1 = n + 1 log + ξ r Φ r v 1. In the third step, we used that the second term is always larger than 1. Insertion into αvβv dv leads to αvβv dv [ αv n + 1 log + + h ] ξ r Φ r v 1 dv = n n + 1 log ξ r pv h Φ r v 1 dv, 9 which is finite by assumption. 3.. Finite Gaussian mixtures Now we apply Lemma 4 and Theorem 5 to mixtures of Gaussian distributions with uniform variance. The KL-divergence is continuous for strictly positive distributions. Our previous results thus imply that for every mixture p mix of Gaussian distributions with uniform variance and every δ 0 we can find a B-DBN p such that KLp p mix δ. The following corollary gives a corresponding bound: Corollary 6. Let = R m and G σ be the family of Gaussian distributions with variance σ. Let ɛ > 0 and p mix v = n i=1 p mixv ip mix i MIXn, G σ a mixture of n distributions with means z i R m, i = 1,..., n. By D = { z r max r,s {1,...,n} k k {1,...,m} z s k } we denote the edge length of the smallest hypercube containing all means. Then there exists pv, h, ĥ B-DBNG σ, with h {0, 1} n : ph q mix h < ɛ and p mix v i = pv e i, i = 1,..., n, such that KLp p mix ɛ n n + 1 log + m n n σ/d + πσ/d + n 1 + ɛ log1 + ɛ. Proof. In a first step we apply an affine linear transformation to map the hypercube of edge length D to the unit hypercube [0, 1] m. Note that doing this while transforming the B-DBN-distribution accordingly does not change the KL-divergence, but it does change the standard deviation of the Gaussians from σ to σ/d. In other words, it suffices to show the above bound for D = 1 and z i [0, 1] m. The energy of the Gaussian-Binary-RBM pv, h is typically written as Ev, h = 1 σ vt v 1 σ vt b 1 σ vt W h c T h, with weight matrix W and bias vectors b and c. This can be brought into the form of formula 3 by setting k =, φ 1 j v j = v j, φ j v j = vj, W 1 = W /σ, W = 0, b 1 j = b j /σ, and b j = 1/σ. With b = 0 and thus b 1 = 0, it follows from Lemma 4 that a B-DBN pv, h, ĥ = pv hph, ĥ with properties i and ii from Theorem 5 exists. It remains to show that property iii holds. Since the conditional probability factorizes, it suffices to show that iii holds for every visible variable individually. The conditional probability of the jth visible neuron of the constructed B-DBN is given by pv j h = 1 v j z j h πσ σ where the mean z j h is the jth element of W h. Using this, it is easy to see that pv j h φ v j dv j = pv j hv j dv j <, because it is the second moment of the normal distri-,

7 bution. For pv j h φ 1 v j dv j we get 1 πσ = z j h + πσ = z j h+ πσ z j h + v j z j h 0 σ v j dv j v j z j h σ z jh t σ = z j h + z j h pv j hdv j 0 + t πσ z jh σ t dt σ n + π z j h σ v j dv j t+z j hdt σ π. 10 In the last step we used that z j e i = z i j [0, 1] by construction and thus z j h can be bounded from above by n z j h = h i z j e i n. 11 i=0 Thus it follows from Theorem 5 that the bound from Lemma 3 holds and is finite. To get the actual bound, we only need to find the constants ξ 1 and ξ to be inserted into 9. The first constant is given by ξ 1 = max j,h,i z jh σ zi j. σ It can be upper bounded z by max jh j,h σ n σ, as an application of equation 11 shows. The second constant is given by ξ = 1 max j,h,i σ 1 σ = 0. Inserting these variables into inequality 9 leads to the bound. The bound BG σ, p mix, ɛ is also finite when is restricted to a compact subset of R m. This can easily be verified by adapting equation 10 accordingly. Similar results can be obtained for other families of distributions. A prominent example are B-DBMs with truncated onential distributions. In this case the energy function of the first layer is the same as for the binary RBM, but the values of the visible neurons are chosen from the interval [0, 1] instead of {0, 1}. It is easy to see that for every choice of parameters the normalization constant as well as the bound are finite Infinite mixtures We will now transfer our results for finite mixtures to the case of infinite mixtures following Li & Barron 000. Theorem 7. Let G be a family of continuous distributions and f CONVG such that the bound from Theorem 1 is finite for all p mix-n MIXn, G, n N. Furthermore, for all p mix-n MIXn, G, n N, and for all ˆɛ > 0 let there exist a B-DBN in B-DBNG such that BG, p mix, ˆɛ is finite. Then for all ɛ > 0 there exists pv, h, ĥ B-DBNG with KLf p ɛ. Proof. From Theorem 1 and the assumption that the corresponding bound is finite it follows that for all ɛ > 0 there exists a mixture p mix-n MIXn, G with n c f γ/ɛ such that KLf p mix-n ɛ. By assumption there exists a B-DBN B-DBNG such that BG, p mix-n, ˆɛ is finite. Thus, one can define a sequence of B-DBNs pˆɛ ˆɛ B-DBNG with ˆɛ decaying to zero where the B-DBNs only differ in the weights between the hidden layers for which it holds KLpˆɛ p mix-n ˆɛ 0 0. This implies that ˆɛ 0 pˆɛ p mix-n uniformly. It follows KLf pˆɛ ˆɛ 0 KLf p mix-n. Thus, there exists ɛ such that KLf p ɛ KLf p mix-n < ɛ/. A combination of these inequalities yields KLf p ɛ KLf p ɛ KLf p mix-n +KLf p mix-n ɛ. This result applies to infinite mixtures of Gaussians with the same fixed but arbitrary variance σ in all components. In the limit σ 0 such mixtures can approximate strictly positive densities over compact sets arbitrarily well Zeevi & Meir, Conclusions We presented a step towards understanding the representational power of DBNs for modeling real-valued data. When binary latent variables are considered, DBNs with two hidden layers can already achieve good approximation results. Under mild constraints, we showed that for modeling a mixture of n pairwise independent distributions, a DBN with only n + 1 binary hidden units is sufficient to make the KL-divergence between the mixture p mix and the DBN distribution p arbitrarily small i.e., for every δ > 0 we can find a DBN such that KLp p mix < δ. This holds for deep architectures used in practice, for instance DBNs having visible neurons with Gaussian or truncated onential conditional distributions, and corresponding mixture distributions having components of the same type as the visible units of the DBN. Furthermore, we extended these results to infinite mixtures and showed that these can be approximated arbitrarily well by

8 a DBN with a finite number of neurons. Therefore, Gaussian-binary DBNs inherit the universal approximation properties from additive Gaussian mixtures, which can model any strictly positive density over a compact domain with arbitrarily high accuracy. Acknowledgments OK and CI acknowledge support from the Danish National Advanced Technology Foundation through project Personalized breast cancer screening, AF and CI acknowledge support from the German Federal Ministry of Education and Research within the Bernstein Fokus Learning behavioral models: From human eriment to technical assistance. References Hinton, Geoffrey E. Training products of erts by minimizing contrastive divergence. Neural Computation, 148: , 00. Hinton, Geoffrey E. and Salakhutdinov, Ruslan. Reducing the dimensionality of data with neural networks. Science, : , 006. Hinton, Geoffrey E., Osindero, Simon, and Teh, Yee- Whye. A fast learning algorithm for deep belief nets. Neural Computation, 187: , 006. Le Roux, Nicolas and Bengio, Yoshua. Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation, 06: , 008. Le Roux, Nicolas and Bengio, Yoshua. Deep belief networks are compact universal approximators. Neural Computation, 8:19 07, 010. Le Roux, Nicolas, Heess, Nicolas, Shotton, Jamie, and Winn, John M. Learning a generative model of images by factoring appearance and shape. Neural Computation, 33: , 011. Lee, Honglak, Grosse, Roger, Ranganath, Rajesh, and Ng, Andrew Y. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 6th Annual International Conference on Machine Learning ICML, pp ACM, 009. Montufar, Guido and Ay, Nihat. Refinements of universal approximation results for deep belief networks and restricted Boltzmann machines. Neural Computation, 35: , 011. Salakhutdinov, Ruslan and Hinton, Geoffrey E. Learning a nonlinear embedding by preserving class neighbourhood structure. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics AISTATS, volume, pp , 007. Smolensky, Paul. Information processing in dynamical systems: foundations of harmony theory. In Rumelhart, David E. and McClelland, James L. eds., Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1, pp MIT Press, Taylor, Graham W., Fergus, Rob, LeCun, Yann, and Bregler, Christoph. Convolutional learning of spatio-temporal features. In Computer Vision ECCV 010, volume 6316 of LNCS, pp Springer, 010. Wang, Nan, Melchior, Jan, and Wiskott, Laurenz. An analysis of Gaussian-binary restricted Boltzmann machines for natural images. In Verleysen, Michel ed., Proceedings of the 0th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning ESANN 01, pp Evere, Belgium: d-side publications, 01. Welling, Max, Rosen-Zvi, Michal, and Hinton, Geoffrey E. Exponential family harmoniums with an application to information retrieval. In Saul, Lawrence K., Weiss, Yair, and Bottou, Léon eds., Advances in Neural Information Processing Systems 17 NIPS 004, pp MIT Press, 005. Zeevi, Assaf J. and Meir, Ronny. Density estimation through convex combinations of densities: Approximation and estimation bounds. Neural Networks, 101:99 109, Li, Jonathan Q. and Barron, Andrew R. Mixture density estimation. In Solla, Sara A., Leen, Todd K., and Müller, Klaus-Robert eds., Advances in Neural Information Processing Systems 1 NIPS 1999, pp MIT Press, 000.

Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines

Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines Asja Fischer and Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum,

More information

Introduction to Restricted Boltzmann Machines

Introduction to Restricted Boltzmann Machines Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,

More information

Deep Belief Networks are compact universal approximators

Deep Belief Networks are compact universal approximators 1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation

More information

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber Representational Power of Restricted Boltzmann Machines and Deep Belief Networks Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber Introduction Representational abilities of functions with some

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Mixture Models and Representational Power of RBM s, DBN s and DBM s

Mixture Models and Representational Power of RBM s, DBN s and DBM s Mixture Models and Representational Power of RBM s, DBN s and DBM s Guido Montufar Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany. montufar@mis.mpg.de Abstract

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Denoising Autoencoders

Denoising Autoencoders Denoising Autoencoders Oliver Worm, Daniel Leinfelder 20.11.2013 Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 1 / 11 Introduction Poor initialisation can lead to local minima 1986 -

More information

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Mathias Berglund, Tapani Raiko, and KyungHyun Cho Department of Information and Computer Science Aalto University

More information

Learning Deep Architectures

Learning Deep Architectures Learning Deep Architectures Yoshua Bengio, U. Montreal Microsoft Cambridge, U.K. July 7th, 2009, Montreal Thanks to: Aaron Courville, Pascal Vincent, Dumitru Erhan, Olivier Delalleau, Olivier Breuleux,

More information

Knowledge Extraction from Deep Belief Networks for Images

Knowledge Extraction from Deep Belief Networks for Images Knowledge Extraction from Deep Belief Networks for Images Son N. Tran City University London Northampton Square, ECV 0HB, UK Son.Tran.@city.ac.uk Artur d Avila Garcez City University London Northampton

More information

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient

More information

TUTORIAL PART 1 Unsupervised Learning

TUTORIAL PART 1 Unsupervised Learning TUTORIAL PART 1 Unsupervised Learning Marc'Aurelio Ranzato Department of Computer Science Univ. of Toronto ranzato@cs.toronto.edu Co-organizers: Honglak Lee, Yoshua Bengio, Geoff Hinton, Yann LeCun, Andrew

More information

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions - Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions Simon Luo The University of Sydney Data61, CSIRO simon.luo@data61.csiro.au Mahito Sugiyama National Institute of

More information

Greedy Layer-Wise Training of Deep Networks

Greedy Layer-Wise Training of Deep Networks Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn

More information

Expressive Power and Approximation Errors of Restricted Boltzmann Machines

Expressive Power and Approximation Errors of Restricted Boltzmann Machines Expressive Power and Approximation Errors of Restricted Boltzmann Machines Guido F. Montúfar, Johannes Rauh, and Nihat Ay, Max Planck Institute for Mathematics in the Sciences, Inselstraße 0403 Leipzig,

More information

Knowledge Extraction from DBNs for Images

Knowledge Extraction from DBNs for Images Knowledge Extraction from DBNs for Images Son N. Tran and Artur d Avila Garcez Department of Computer Science City University London Contents 1 Introduction 2 Knowledge Extraction from DBNs 3 Experimental

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines

Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines Desjardins, G., Courville, A., Bengio, Y., Vincent, P. and Delalleau, O. Technical Report 1345, Dept. IRO, Université de

More information

arxiv: v1 [stat.ml] 30 Mar 2016

arxiv: v1 [stat.ml] 30 Mar 2016 A latent-observed dissimilarity measure arxiv:1603.09254v1 [stat.ml] 30 Mar 2016 Yasushi Terazono Abstract Quantitatively assessing relationships between latent variables and observed variables is important

More information

Expressive Power and Approximation Errors of Restricted Boltzmann Machines

Expressive Power and Approximation Errors of Restricted Boltzmann Machines Expressive Power and Approximation Errors of Restricted Boltzmann Machines Guido F. Montúfar Johannes Rauh Nihat Ay SFI WORKING PAPER: 2011-09-041 SFI Working Papers contain accounts of scientific work

More information

Deep Generative Models. (Unsupervised Learning)

Deep Generative Models. (Unsupervised Learning) Deep Generative Models (Unsupervised Learning) CEng 783 Deep Learning Fall 2017 Emre Akbaş Reminders Next week: project progress demos in class Describe your problem/goal What you have done so far What

More information

Deep Belief Networks are Compact Universal Approximators

Deep Belief Networks are Compact Universal Approximators Deep Belief Networks are Compact Universal Approximators Franck Olivier Ndjakou Njeunje Applied Mathematics and Scientific Computation May 16, 2016 1 / 29 Outline 1 Introduction 2 Preliminaries Universal

More information

Deep Neural Networks

Deep Neural Networks Deep Neural Networks DT2118 Speech and Speaker Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 45 Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi

More information

The Recurrent Temporal Restricted Boltzmann Machine

The Recurrent Temporal Restricted Boltzmann Machine The Recurrent Temporal Restricted Boltzmann Machine Ilya Sutskever, Geoffrey Hinton, and Graham Taylor University of Toronto {ilya, hinton, gwtaylor}@cs.utoronto.ca Abstract The Temporal Restricted Boltzmann

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen 206-09-22 Jakob Verbeek & Daan Wynen Unsupervised Neural Networks Outline Autoencoders Restricted) Boltzmann

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine Nitish Srivastava nitish@cs.toronto.edu Ruslan Salahutdinov rsalahu@cs.toronto.edu Geoffrey Hinton hinton@cs.toronto.edu

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

Generative Kernel PCA

Generative Kernel PCA ESA 8 proceedings, European Symposium on Artificial eural etworks, Computational Intelligence and Machine Learning. Bruges (Belgium), 5-7 April 8, i6doc.com publ., ISB 978-8758747-6. Generative Kernel

More information

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

Modeling Documents with a Deep Boltzmann Machine

Modeling Documents with a Deep Boltzmann Machine Modeling Documents with a Deep Boltzmann Machine Nitish Srivastava, Ruslan Salakhutdinov & Geoffrey Hinton UAI 2013 Presented by Zhe Gan, Duke University November 14, 2014 1 / 15 Outline Replicated Softmax

More information

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines Christopher Tosh Department of Computer Science and Engineering University of California, San Diego ctosh@cs.ucsd.edu Abstract The

More information

Density functionals from deep learning

Density functionals from deep learning Density functionals from deep learning Jeffrey M. McMahon Department of Physics & Astronomy March 15, 2016 Jeffrey M. McMahon (WSU) March 15, 2016 1 / 18 Kohn Sham Density-functional Theory (KS-DFT) The

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

CONVOLUTIONAL DEEP BELIEF NETWORKS

CONVOLUTIONAL DEEP BELIEF NETWORKS CONVOLUTIONAL DEEP BELIEF NETWORKS Talk by Emanuele Coviello Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations Honglak Lee Roger Grosse Rajesh Ranganath

More information

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 8: Autoencoder & DBM Princeton University COS 495 Instructor: Yingyu Liang Autoencoder Autoencoder Neural networks trained to attempt to copy its input to its output Contain

More information

The flip-the-state transition operator for restricted Boltzmann machines

The flip-the-state transition operator for restricted Boltzmann machines Mach Learn 2013) 93:53 69 DOI 10.1007/s10994-013-5390-3 The flip-the-state transition operator for restricted Boltzmann machines Kai Brügge Asja Fischer Christian Igel Received: 14 February 2013 / Accepted:

More information

Efficient Learning of Sparse, Distributed, Convolutional Feature Representations for Object Recognition

Efficient Learning of Sparse, Distributed, Convolutional Feature Representations for Object Recognition Efficient Learning of Sparse, Distributed, Convolutional Feature Representations for Object Recognition Kihyuk Sohn Dae Yon Jung Honglak Lee Alfred O. Hero III Dept. of Electrical Engineering and Computer

More information

An Efficient Learning Procedure for Deep Boltzmann Machines

An Efficient Learning Procedure for Deep Boltzmann Machines ARTICLE Communicated by Yoshua Bengio An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov rsalakhu@utstat.toronto.edu Department of Statistics, University of Toronto, Toronto,

More information

Deep Learning the Ising Model Near Criticality

Deep Learning the Ising Model Near Criticality Deep Learning the Ising Model ear Criticality It is empirically well supported that neural networks with deep architectures perform better than shallow networks for certain machine learning tasks involving

More information

Deep Learning of Invariant Spatiotemporal Features from Video. Bo Chen, Jo-Anne Ting, Ben Marlin, Nando de Freitas University of British Columbia

Deep Learning of Invariant Spatiotemporal Features from Video. Bo Chen, Jo-Anne Ting, Ben Marlin, Nando de Freitas University of British Columbia Deep Learning of Invariant Spatiotemporal Features from Video Bo Chen, Jo-Anne Ting, Ben Marlin, Nando de Freitas University of British Columbia Introduction Focus: Unsupervised feature extraction from

More information

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight) CSCE 636 Neural Networks Instructor: Yoonsuck Choe Deep Learning What Is Deep Learning? Learning higher level abstractions/representations from data. Motivation: how the brain represents sensory information

More information

arxiv: v1 [cs.ne] 6 May 2014

arxiv: v1 [cs.ne] 6 May 2014 Training Restricted Boltzmann Machine by Perturbation arxiv:1405.1436v1 [cs.ne] 6 May 2014 Siamak Ravanbakhsh, Russell Greiner Department of Computing Science University of Alberta {mravanba,rgreiner@ualberta.ca}

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Restricted Boltzmann Machines for Collaborative Filtering

Restricted Boltzmann Machines for Collaborative Filtering Restricted Boltzmann Machines for Collaborative Filtering Authors: Ruslan Salakhutdinov Andriy Mnih Geoffrey Hinton Benjamin Schwehn Presentation by: Ioan Stanculescu 1 Overview The Netflix prize problem

More information

When does a mixture of products contain a product of mixtures?

When does a mixture of products contain a product of mixtures? When does a mixture of products contain a product of mixtures? Jason Morton Penn State May 19, 2014 Algebraic Statistics 2014 IIT Joint work with Guido Montufar Supported by DARPA FA8650-11-1-7145 Jason

More information

Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes Ruslan Salakhutdinov and Geoffrey Hinton Department of Computer Science, University of Toronto 6 King s College Rd, M5S 3G4, Canada

More information

Learning Deep Architectures

Learning Deep Architectures Learning Deep Architectures Yoshua Bengio, U. Montreal CIFAR NCAP Summer School 2009 August 6th, 2009, Montreal Main reference: Learning Deep Architectures for AI, Y. Bengio, to appear in Foundations and

More information

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu

More information

BIAS-VARIANCE DECOMPOSITION FOR BOLTZMANN MACHINES

BIAS-VARIANCE DECOMPOSITION FOR BOLTZMANN MACHINES BIAS-VARIANCE DECOMPOSITION FOR BOLTZMANN MACHINES Anonymous authors Paper under double-blind review ABSTRACT We achieve bias-variance decomposition for Boltzmann machines using an information geometric

More information

Gaussian Cardinality Restricted Boltzmann Machines

Gaussian Cardinality Restricted Boltzmann Machines Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Gaussian Cardinality Restricted Boltzmann Machines Cheng Wan, Xiaoming Jin, Guiguang Ding and Dou Shen School of Software, Tsinghua

More information

A Novel Anomaly Detection Algorithm for Hybrid Production Systems based on Deep Learning and Timed Automata

A Novel Anomaly Detection Algorithm for Hybrid Production Systems based on Deep Learning and Timed Automata A Novel Anomaly Detection Algorithm for Hybrid Production Systems based on Deep Learning and Timed Automata Nemanja Hranisavljevic 1 and Oliver Niggemann 1,2 and Alexander Maier 2 1 init - Institute Industrial

More information

arxiv: v1 [stat.ml] 6 Aug 2014

arxiv: v1 [stat.ml] 6 Aug 2014 MCMC for Hierarchical Semi-Markov Conditional Random Fields arxiv:8.6v [stat.ml] 6 Aug Tran The Truyen, Dinh Q. Phung, Svetha Venkatesh Department of Computing, Curtin University of Technology, Australia

More information

Chapter 20. Deep Generative Models

Chapter 20. Deep Generative Models Peng et al.: Deep Learning and Practice 1 Chapter 20 Deep Generative Models Peng et al.: Deep Learning and Practice 2 Generative Models Models that are able to Provide an estimate of the probability distribution

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines

Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines KyungHyun Cho KYUNGHYUN.CHO@AALTO.FI Tapani Raiko TAPANI.RAIKO@AALTO.FI Alexander Ilin ALEXANDER.ILIN@AALTO.FI Department

More information

Deep Learning & Neural Networks Lecture 2

Deep Learning & Neural Networks Lecture 2 Deep Learning & Neural Networks Lecture 2 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 16, 2014 2/45 Today s Topics 1 General Ideas in Deep Learning Motivation

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Linear and Non-Linear Dimensionality Reduction

Linear and Non-Linear Dimensionality Reduction Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections

More information

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight) CSCE 636 Neural Networks Instructor: Yoonsuck Choe Deep Learning What Is Deep Learning? Learning higher level abstractions/representations from data. Motivation: how the brain represents sensory information

More information

Discriminative Learning can Succeed where Generative Learning Fails

Discriminative Learning can Succeed where Generative Learning Fails Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,

More information

arxiv: v1 [cs.lg] 10 Jun 2016

arxiv: v1 [cs.lg] 10 Jun 2016 Deep Directed Generative Models with Energy-Based Probability Estimation arxiv:1606.03439v1 [cs.lg] 10 Jun 2016 Taesup Kim, Yoshua Bengio Department of Computer Science and Operations Research Université

More information

Self-Organization by Optimizing Free-Energy

Self-Organization by Optimizing Free-Energy Self-Organization by Optimizing Free-Energy J.J. Verbeek, N. Vlassis, B.J.A. Kröse University of Amsterdam, Informatics Institute Kruislaan 403, 1098 SJ Amsterdam, The Netherlands Abstract. We present

More information

Annealing Between Distributions by Averaging Moments

Annealing Between Distributions by Averaging Moments Annealing Between Distributions by Averaging Moments Chris J. Maddison Dept. of Comp. Sci. University of Toronto Roger Grosse CSAIL MIT Ruslan Salakhutdinov University of Toronto Partition Functions We

More information

Restricted Boltzmann Machines

Restricted Boltzmann Machines Restricted Boltzmann Machines http://deeplearning4.org/rbm-mnist-tutorial.html Slides from Hugo Larochelle, Geoffrey Hinton, and Yoshua Bengio CSC321: Intro to Machine Learning and Neural Networks, Winter

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

Max-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig

Max-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig Max-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig Hierarchical Models as Marginals of Hierarchical Models by Guido Montúfar and Johannes Rauh Preprint no.: 27 2016 Hierarchical Models

More information

On the complexity of shallow and deep neural network classifiers

On the complexity of shallow and deep neural network classifiers On the complexity of shallow and deep neural network classifiers Monica Bianchini and Franco Scarselli Department of Information Engineering and Mathematics University of Siena Via Roma 56, I-53100, Siena,

More information

Deep Boltzmann Machines

Deep Boltzmann Machines Deep Boltzmann Machines Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel University of Illinois Urbana Champaign agoel10@illinois.edu December 2, 2016 Ruslan Salakutdinov and Geoffrey E. Hinton Amish

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

CSC321 Lecture 20: Autoencoders

CSC321 Lecture 20: Autoencoders CSC321 Lecture 20: Autoencoders Roger Grosse Roger Grosse CSC321 Lecture 20: Autoencoders 1 / 16 Overview Latent variable models so far: mixture models Boltzmann machines Both of these involve discrete

More information

Discrete Restricted Boltzmann Machines

Discrete Restricted Boltzmann Machines Journal of Machine Learning Research 16 (2015) 653-672 Submitted 6/13; Published 4/15 Discrete Restricted Boltzmann Machines Guido Montúfar Max Planck Institute for Mathematics in the Sciences Inselstrasse

More information

Improved Local Coordinate Coding using Local Tangents

Improved Local Coordinate Coding using Local Tangents Improved Local Coordinate Coding using Local Tangents Kai Yu NEC Laboratories America, 10081 N. Wolfe Road, Cupertino, CA 95129 Tong Zhang Rutgers University, 110 Frelinghuysen Road, Piscataway, NJ 08854

More information

Stochastic Variational Inference

Stochastic Variational Inference Stochastic Variational Inference David M. Blei Princeton University (DRAFT: DO NOT CITE) December 8, 2011 We derive a stochastic optimization algorithm for mean field variational inference, which we call

More information

Joint Training of Partially-Directed Deep Boltzmann Machines

Joint Training of Partially-Directed Deep Boltzmann Machines Joint Training of Partially-Directed Deep Boltzmann Machines Ian J. Goodfellow goodfeli@iro.umontreal.ca Aaron Courville aaron.courville@umontreal.ca Yoshua Bengio Département d Informatique et de Recherche

More information

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However, Odyssey 2012 The Speaker and Language Recognition Workshop 25-28 June 2012, Singapore First attempt of Boltzmann Machines for Speaker Verification Mohammed Senoussaoui 1,2, Najim Dehak 3, Patrick Kenny

More information

Bayesian Hidden Markov Models and Extensions

Bayesian Hidden Markov Models and Extensions Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling

More information

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury Metric Learning 16 th Feb 2017 Rahul Dey Anurag Chowdhury 1 Presentation based on Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "A survey on metric learning for feature vectors and structured data."

More information

Maxout Networks. Hien Quoc Dang

Maxout Networks. Hien Quoc Dang Maxout Networks Hien Quoc Dang Outline Introduction Maxout Networks Description A Universal Approximator & Proof Experiments with Maxout Why does Maxout work? Conclusion 10/12/13 Hien Quoc Dang Machine

More information

Latent Variable Models

Latent Variable Models Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

More information

Sum-Product Networks: A New Deep Architecture

Sum-Product Networks: A New Deep Architecture Sum-Product Networks: A New Deep Architecture Pedro Domingos Dept. Computer Science & Eng. University of Washington Joint work with Hoifung Poon 1 Graphical Models: Challenges Bayesian Network Markov Network

More information

An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey Hinton

An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey Hinton Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2010-037 August 4, 2010 An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey

More information

Large-scale Ordinal Collaborative Filtering

Large-scale Ordinal Collaborative Filtering Large-scale Ordinal Collaborative Filtering Ulrich Paquet, Blaise Thomson, and Ole Winther Microsoft Research Cambridge, University of Cambridge, Technical University of Denmark ulripa@microsoft.com,brmt2@cam.ac.uk,owi@imm.dtu.dk

More information

Hopfield Networks and Boltzmann Machines. Christian Borgelt Artificial Neural Networks and Deep Learning 296

Hopfield Networks and Boltzmann Machines. Christian Borgelt Artificial Neural Networks and Deep Learning 296 Hopfield Networks and Boltzmann Machines Christian Borgelt Artificial Neural Networks and Deep Learning 296 Hopfield Networks A Hopfield network is a neural network with a graph G = (U,C) that satisfies

More information

arxiv: v1 [stat.ml] 2 Sep 2014

arxiv: v1 [stat.ml] 2 Sep 2014 On the Equivalence Between Deep NADE and Generative Stochastic Networks Li Yao, Sherjil Ozair, Kyunghyun Cho, and Yoshua Bengio Département d Informatique et de Recherche Opérationelle Université de Montréal

More information

A Connection Between Score Matching and Denoising Autoencoders

A Connection Between Score Matching and Denoising Autoencoders A Connection Between Score Matching and Denoising Autoencoders Pascal Vincent vincentp@iro.umontreal.ca Dept. IRO, Université de Montréal, CP 68, Succ. Centre-Ville, Montréal (QC) H3C 3J7, Canada. Technical

More information

Natural Gradients via the Variational Predictive Distribution

Natural Gradients via the Variational Predictive Distribution Natural Gradients via the Variational Predictive Distribution Da Tang Columbia University datang@cs.columbia.edu Rajesh Ranganath New York University rajeshr@cims.nyu.edu Abstract Variational inference

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

A Connection Between Score Matching and Denoising Autoencoders

A Connection Between Score Matching and Denoising Autoencoders NOTE CommunicatedbyAapoHyvärinen A Connection Between Score Matching and Denoising Autoencoders Pascal Vincent vincentp@iro.umontreal.ca Département d Informatique, UniversitédeMontréal, Montréal (QC)

More information

A Hybrid Deep Learning Approach For Chaotic Time Series Prediction Based On Unsupervised Feature Learning

A Hybrid Deep Learning Approach For Chaotic Time Series Prediction Based On Unsupervised Feature Learning A Hybrid Deep Learning Approach For Chaotic Time Series Prediction Based On Unsupervised Feature Learning Norbert Ayine Agana Advisor: Abdollah Homaifar Autonomous Control & Information Technology Institute

More information

Contrastive Divergence

Contrastive Divergence Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer Science TU Graz June 9, 2010 Contents 1 Theory 2 Argument 3 Contrastive

More information

Fundamentals of Computational Neuroscience 2e

Fundamentals of Computational Neuroscience 2e Fundamentals of Computational Neuroscience 2e January 1, 2010 Chapter 10: The cognitive brain Hierarchical maps and attentive vision A. Ventral visual pathway B. Layered cortical maps Receptive field size

More information

Neural Networks: A Very Brief Tutorial

Neural Networks: A Very Brief Tutorial Neural Networks: A Very Brief Tutorial Chloé-Agathe Azencott Machine Learning & Computational Biology MPIs for Developmental Biology & for Intelligent Systems Tübingen (Germany) cazencott@tue.mpg.de October

More information

Deep Gaussian Processes

Deep Gaussian Processes Deep Gaussian Processes Neil D. Lawrence 8th April 2015 Mascot Num 2015 Outline Introduction Deep Gaussian Process Models Variational Methods Composition of GPs Results Outline Introduction Deep Gaussian

More information

arxiv: v4 [cs.lg] 16 Apr 2015

arxiv: v4 [cs.lg] 16 Apr 2015 REWEIGHTED WAKE-SLEEP Jörg Bornschein and Yoshua Bengio Department of Computer Science and Operations Research University of Montreal Montreal, Quebec, Canada ABSTRACT arxiv:1406.2751v4 [cs.lg] 16 Apr

More information

Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

More information

Deep Gaussian Processes

Deep Gaussian Processes Deep Gaussian Processes Neil D. Lawrence 30th April 2015 KTH Royal Institute of Technology Outline Introduction Deep Gaussian Process Models Variational Approximation Samples and Results Outline Introduction

More information