ABoundfortheConvergenceRateofParallel Tempering for Sampling Restricted Boltzmann Machines
|
|
- Helena Gibbs
- 5 years ago
- Views:
Transcription
1 ABoundfortheConvergenceRateofarallel Tempering for Sampling Restricted Boltzmann Machines Asja Fischer a,b,christianigel b a Institut für Neuroinformatik, Ruhr-Universität Bochum, Bochum, Germany b Department of Computer Science, University of Copenhagen, 00 Copenhagen, Denmark Abstract Sampling from restricted Boltzmann machines (RBMs) is done by Markov chain Monte Carlo (MCMC) methods. The faster the convergence of the Markov chain, the more e ciently can high quality samples be obtained. This is also important for robust training of RBMs, which usually relies on sampling. arallel tempering (T), an MCMC method that maintains several replicas of the original chain at higher temperatures, has been successfully applied for RBM training. We present the first analysis of the convergence rate of T for sampling from binary RBMs. The resulting bound on the rate of convergence of the T Markov chain shows an exponential dependency on the size of one layer and the absolute values of the RBM parameters. It is minimized by a uniform spacing of the inverse temperatures, which is often used in practice. Similar as in the derivation of bounds on the approximation error for contrastive divergence learning, our bound on the mixing time implies an upper bound on the error of the gradient approximation when the method is used for RBM training. Keywords: restricted Boltzmann machines, parallel tempering, mixing time. Introduction Restricted Boltzmann machines (RBMs) are probabilistic graphical models corresponding to stochastic neural networks [, ] (see [3] for a recent review). They are applied in many machine learning tasks, notably they serve as building blocks of deep belief networks [4]. Markov chain Monte reprint submitted to Theoretical Computer Science September 9, 05
2 Carlo (MCMC) methods are used to sample from RBMs, and chains that quickly converge to their stationary distribution are desirable to e ciently get high quality samples. Adaptation of the RBM model parameters typically corresponds to gradient-based likelihood maximization given training data. As computing the exact gradient is usually computationally not tractable, sampling-based methods are employed to approximate the likelihood gradient. It has been shown that inaccurate approximations can deteriorate the learning process (e.g., [5]), and for the most popular learning scheme contrastive divergence learning (CD, []) the approximation quality has been analyzed [6, 7]. The quality of the approximation depends, among other things, on how quickly the Markov chain approaches the stationary distribution, that is, on its mixing rate. To improve RBM learning, parallel tempering (T, [8]) has successfully been used as a sampling method in RBM training [9, 0,, 3]. arallel tempering introduces supplementary Gibbs chains that sample from smoothed replicas of the original distribution with the goal of improving the mixing rate. However, so far there exist no published attempts to analyze the mixing rate of T applied to RBMs. Based on the work by Woodard et al. [], we provide the first such analysis. After introducing the basic concepts, section 3statesourmainresult. Section4summarizesgeneraltheoremsrequiredfor our proof in section 5, which is followed by a discussion and our conclusions.. Background In the following, we will give a brief introduction to RBMs and the relation between the mixing rate and the spectral gap of a Markov chain. Afterwards we will describe the parallel tempering algorithm and its application to sampling from RBMs... Restricted Boltzmann machines Restricted Boltzmann machines (RBMs) are probabilistic undirected graphical models (Markov random fields). Their structure is a bipartite graph connecting a set of m visible random variables V =(V,V,...,V m )modeling observations to n hidden (latent) random variables H =(H,H,...,H n ) capturing dependencies between the visible variables. In binary RBMs the state space of one single variable is given by = {0, } and accordingly (V, H) {0, } m+n. The joint probability distribution of (V, H) isgiven
3 by the Gibbs distribution with energy E(v, h) = (v, h) = exp( n i= m h i w ij v j j= E(v, h)) Z m j= b j v j, () n c i h i i= and real-valued connection weights w ij and bias parameters b j and c i for i {,...,n} and j {,...,m}. The normalization constant Z, also called the partition function, is given by Z = v,h exp( E(v, h)). Training an RBM means adapting its parameters such that the distribution of V models a distribution underlying some observed data. In practice, this training corresponds to performing stochastic gradient ascent on the loglikelihood of the weight and bias parameters given sample (training) data. The gradient of the log-likelihood given a single training sample v train is given ln (v train ) (h v train train, h) h) h v,h, () where is the vector collecting all parameters. Since the expectation under the model distribution in the second term on the right hand side can not be computed e ciently (it is exponential in min(n, m)), it is approximated by MCMC methods in RBM training algorithms. Typically, the expected value under the model distribution is approximated by (h v h given a sample v (k) obtained by running a Markov chain for k steps. Alternatively, we can consider (v h (k) ) given a sample h (k) to v computation time if m<n... Mixing rates and the spectral gap AhomogeneousMarkovchainonadiscretestatespace canbedescribed by a transition matrix =(p x,y ) x,y,wherep x,y is the probability to move from x to y in one step of the Markov chain. We also refer to this probability as (x, y), and accordingly k (x, y) gives the probability to move from x to y in k steps of the chain. 3
4 Markov chain Monte Carlo methods make use of the fact that an ergodic Markov chain on with transition matrix and equilibrium distribution satisfies k (x, y)! (y) ask!for all x, y. It is important to study the convergence rate of the Markov chain in order to find out how large k has to be to ensure that k (x, y) issuitablycloseto (y). One way to measure the closeness to stationarity is the total variation distance k k (x, ) k = y k (x, y) (y) for an arbitrary starting state x. Reversibility of the chain implies that the eigenvalues of are real-valued, and we sort them by value = r. The value Gap( )= is called the spectral gap of, and the eigenvalue with the second largest absolute value SLEM =max{, r } is often referred to as second largest eigenvalue modulus (SLEM). Many bounds on convergence rates are based on the SLEM. Diaconis and Salo -Coste [3], for example, prove k k (x, ) k apple p (x) SLEMk. (3) If is positive definite all eigenvalues are non-negative. In this case SLEM = and we can replace SLEM in (3) (and similar bounds) by Gap( ). We can make an arbitrary transition matrix Q positive definite by skipping the move each time with probability, that is, by considering the transition matrix = I + Q. This makes in the long run two times slower than Q. AtypicalapplicationofMCMCmethods(e.g.,inthetrainingofRBMs) is to approximate the expected value of a given function under the stationary distribution based on samples obtained after running the Markov chain for k steps. A bound on the chain s convergence rate directly leads to a bound on the bias of this approximation (see Appendix A for a proof)..3. arallel tempering Consider a multi modal target density on a state space from which one would like to sample via a Markov chain. The transition steps of the Metropolis-Hastings algorithm move only locally in space so that the resulting Markov chain may move between the modes of only infrequently. The T algorithm [8] tries to overcome this by introducing supplementary Markov chains, which are flattened or smoothed versions of. These (and the original density) are referred to as tempered densities t, t =0,...,N fulfilling t (z) / (z) t for z withinverse temperatures t. The inverse 4
5 temperatures satisfy 0 apple 0 < < N =. aralleltemperingnowconstructs a Markov chain on the joint state space T = N+ of the tempered distributions. The chain takes values x =(x [0],...,x [N] ) T and converges to the stationary distribution NY T (x) = t (x [t] ). t=0 The T transition matrix consists out of two components, one for updating the states of the individual tempered chains and one allowing swaps between states of adjacent tempered chains. We will denote the corresponding transition matrices by T and Q, respectively. Fortheconsiderationsin this paper, we add a probability of of staying in the current state each time T or Q is applied to guarantee positive definiteness and thus positive eigenvalues. We define the T matrix as = QT Q. Atransitionmatrixconstructed like this assures reversibility (unlike TQ)andisusedintheanalysis by Madras and Zheng [4] and Woodard et al. []. Here the matrices T and Q are defined as follows: T chooses t {0,...,N} uniformly and updates the state of t based on some transition matrix T t,whichisreversiblewithrespectto t. This can, for example, be the Metropolis-Hastings or a Gibbs matrix [5]. Thus, the probability of moving from x =(x [0],...,x [N] ) T to y =(y [0],...,y [N] ) T under T is given by T (x, y) = (N +) N T t (x [t], y [t] )I {x[ t] }(y [ t] )+ I {x}(y), t=0 where x [ t] =(x [0],...,x [t ], x [t+],...,x [N] ) N collects all components of x except the t-th one and, for any set B, I B (y) =ify B and I B (y) =0, otherwise. The I {x}(y)ensuresnon-negativeeigenvaluesasdiscussedabove. The swapping matrix Q proposes to swap state x [t] and x [t+] by sampling t uniformly from {0,...,N }. The proposed swap is accepted with the Metropolis probability a(x,t)=min, t(x [t+] ) t+ (x [t] ) t (x [t] ) t+ (x [t+] ), (4) which guarantees detailed balance. So the transition probability from x = (x [0],...,x [N] )toy =(x [0],...,x [t+], x [t],...x [N] )isgivenby Q(x, y) = N a(x,t), 5
6 the probability to stay in x is Q(x, x) = N t=0 N a(x,t), and the transition probability between all other states is 0. By construction both T and Q are reversible with respect to T and strictly positive definite due to their holding probability. It follows from the definition of that both properties also hold for. Thus, has only positive real-valued eigenvalues and the convergence of the T chain to the stationary distribution T can be bounded in terms of (3) by replacing SLEM by, where is a lower bound for Gap( )..4. Training RBMs with T The T algorithm is often the sampling procedure of choice for RBM learning algorithms to approximate the gradient of the log-likelihood given in equation ()[0, ]. In this setting, the tempered distributions are defined as t (v, h) = exp( E(v, h) t) Z t (5) with Z t = v,h exp( E(v, h) t)and 0 is typically set to 0, which we will also assume in our analysis. The transition matrices T t for the single tempered chains are set to a randomized version of blockwise Gibbs sampling: we choose to update either V or H with equal probability and then update all neurons in this layer simultaneously. The random choice of the layer leads to reversibility of T t. In this paper, we want to find a lower bound for the spectral gap of T sampling as defined above applied to RBMs. In the T algorithm most often used for RBM training in practice the transition operator di ers from the one analyzed in this paper. It consists out of an update step followed by a swapping step. During the update step, one step of blockwise Gibbs sampling is performed in all tempered chains in parallel, where one step of blockwise Gibbs sampling corresponds to sampling a new state h 0 of H based on the conditional probability (h 0 v) giventhecurrentstateofthevisible variables v, followedbysamplinganewstatev 0 of V based on (v 0 h 0 ). During the swapping step, the states at all temperatures may be swapped based on the Metropolis probability a(x,t)asdefinedin(4),withx =(v, h). The swapping is often organized in two substeps, where even temperatures 6
7 are considered in the first and odd temperatures in the second substep. While this approach seems to work in practice, the corresponding transition matrix need not be reversible, which makes it di cult to obtain theoretical results on its mixing behavior, see our discussion in section Main Result Let denote the Gibbs distribution of an RBM as defined in () and for 0 = 0 < < N = let t be the corresponding tempered Gibbs distribution with inverse temperature t as given in (5). Then the stationary distribution of T sampling for RBMs as defined above is given by T ((x [0],...,x [N] )) = Q N t=0 t(x [t] )andforthespectralgapofthecorresponding transition matrix it holds: Theorem. For the T transition matrix of an RBM with m visible and n hidden variables exp( c) my Gap( ) min f n+0 (N +) 4 j, exp( c max t ( t+ t ) ) my f 8 (N +) 4 j exp( b) m+0 (N +) 4 j= ny i= g i, exp( b max t ( t+ t ) ) 8 (N +) 4 j= ny gi. i= (6) with = i,j w ij + j b j + i c i, being the sum over the absolute values of parameters of the RBM, b = j b j and c = i c i being the sum of over the absolute values of the bias parameters of the visible and hidden variables, respectively, and f j = min h (v j = h) max h (v j = h) and g i = min v (h i = v) max v (h i = v) being the fraction of the minimal and maximal activation of the j-th visible and the i-th hidden neuron. 7
8 The proof is given in section 5. By combining this result with equation (3), we arrive at an upper bound on the rate of convergence of the T chain. Corollary. Let be the T transition matrix for an RBM with m visible and n hidden variables and let T be the joint distribution over all tempered chains. Then for any starting point of the Markov chain x =(x [0],...,x [N] ), with x [i] =(v [i], h [i] ), the distance in variation to T is bounded by k k (x, ) T kapple p T (x) ( )k (7) with exp( c) =min n+0 (N +) 4 exp( b) m+0 (N +) 4 my j= ny i= and, b, c, f j, and g i as defined above. f j, exp( c max t ( t+ t ) ) 8 (N +) 4 g i, exp( b max t ( t+ t ) ) 8 (N +) 4 Theorem bounds the gap of the T product chain. As the original chain (the chain with inverse temperature N) neverconvergesslowerthan this product chain (see Appendix B for a proof), corollary also leads to an upper bound for the convergence rate of the chain of interest. Bounding the product chain leads to the dependency on the number N of supplementary tempered chains. However, to our knowledge all approaches analyzing T su er from this drawback (e.g., [, 4, 6]). Now, we can make use of the fact that a bound on the convergence rate implies a bound on the bias of an MCMC estimate and obtain a bound on the bias of the T based estimator when used in RBM training. Recall that for training RBMs the log-likelihood gradient given in equation () is approximated by replacing the expected value under the model distribution in the second term, (v, (h v (k) for v,h h asamplev (k) obtained by running the T-chain for k steps. The value of (h v (k) is bounded by and thus, using the result in h Awithmaxt(x) =,wegetthefollowingcorollary: my j= ny i= f j g i 8
9 Corollary. Given an RBM with m visible and n hidden variables and a T chain starting from x =(x [0],...,x [N] ), with x [i] =(v [i], h [i] ). Let T(x (k) ) be the approximation of the log-likelihood derivative w.r.t some RBM parameter obtained by estimating the second term in equation () based on a T-sample x (k) =(x (k) [0],...,x(k) [N]), with x(k) [i] =(v (k) [i], h(k) [i] ), gained by running the chain for k steps. That is, for a training sample v train T(x (k) )= (h v train train, h) + )@E(v(k) (h v (k) [0], h h (8) Then we can bound the absolute value of the bias of T(x (k) ) ln (v train x (k) k (x, x (k) ) T (x (k) ) apple where is defined as in corollary. p T (x) ( )k, (9) Similar considerations lead to an alternative proof of the approximation error of contrastive divergence learning [7] by combining the result in Appendix A with standard bounds on the periodic Gibbs sampler (e.g. [0], page 89). 4. Bounding the Spectral Gap of T This section summarizes definitions and results required for the proof of theorem. We state important results for bounding the spectral gap of general Markov chains and recall the link between the SLEM and Dobrushin s coe cient. 4.. Definitions For any transition matrix reversible with respect to a distribution and any subset A of the state space of,therestriction of to A is defined as: A (x,b)= (x,b)+i B (x) (x,a c ) (0) for x A, B A. InthiswaytransitionsthatwouldleaveA are prohibited and the probabilities to stay in A are increased correspondingly. 9
10 Given a partition A = {A j : j =,...,J} of a finite such that (A j )= xa j (x) > 0 for all j, theprojection matrix of with respect to A is given by the transition operator with for i, j {,...,J}. (i, j) = (A i ) xa i ya j (x) (x, y) () 4.. General results Given a partition A of the state space as defined above, the following theorem allows to lower bound the spectral gap of the T transition matrix in terms of the spectral gap of the projection matrix and the minimum over the spectral gaps of the restrictions of to the subsets A j. Theorem. Let be a transition matrix reversible with respect to a distribution on a state space. Let {A j : i =,...,J} be any partition of such that (A j ) > 0 for all j. Define Aj as in (0) and as in (). If is nonnegative definite, it holds Gap( ) Gap( ) min j=,...,j Gap( A j ). As Woodard et al. [] show this theorem can be proven based on the results from Caracciolo et al. [7] (which were first published in [8]) and the results from Madras and Zheng [4]. Often, one wishes to bound the spectral gap of one chain based on the spectral gap of another chain, which, for example, is easier to estimate. In this case the following theorem can be applied. Combining the theorem for the comparison of the Dirichlet-forms of two Markov chains given by Diaconis and Salo -Coste [9] with Reyleigh s theorem (e.g., [0], p. 05) we get: Theorem 3. Let and Q be transition matrices on a finite state space, reversible with respect to densities and Q respectively. Denote by E = {(x, y) : (x) (x, y) > 0} and E Q = {(x, y) : Q (x)q(x, y) > 0} the edge sets of the corresponding transition graphs. For each pair x 6= y 0
11 such that (x, y) E Q fix a path x,y =(x = x 0, x,...,x k = y) of length x,y = k such that (x i, x i+ ) E for i {0,...,k } and define c = max x,y Q (x)q(x, y). (z,w)e (z) (z, w) x,y:(z,w) x,y Then it holds Gap(Q) apple c Gap( ). If both chains are reversible with respect to the same distribution, the following theorem proven in [] can be applied: Theorem 4. Let and Q be transition matrices on a state space reversible with respect. If Q(x,A\{x}) apple (x,a\{x}) for every x and every A then Gap(Q) apple Gap( ). Recall that the T transition matrix T combines N +transitionoperators T 0,...,T N by taking values x =(x [0],...,x [N] )inthejointstatespace T and performing a transition by randomly picking t {0,...,N} and sampling a new value x [t] for the t-th chain from T t. Such a Markov chain is called a product chain. The following theorem deals with product chains and gives the dependency between the spectral gap of the product chain and the spectral gaps of all single chains it subsumes ([], lemma 3.): Theorem 5. For any natural number N and t =0,...,N, let t be a t - reversible transition matrix on a state space t. Let be the transition matrix on = Q t t given by (x, y) = N b t t (x [t], y [t] )I {x[ t] }(y [ t] ), x, y t=0 for some set of b t > 0 such that t b t =and where x [t] denotes the t-th component of x (corresponding to the state of the t-th chain) and x [ t] all components except the t-th one. A Markov chain with transition matrix is called a product chain. It is reversible with respect to (x) = Q t t(x [t] ) and Gap( ) = min b t Gap( t ). t=0,...,n The next theorem is a well known result that gives an upper bound on the SLEM (e.g., see [0], p. 37).
12 Theorem 6. The second largest eigenvalue modulus SLEM of a transition matrix =(p x,y ) x,y can be bounded from above by Dobrushin s coe cient: D( )= max p x,z p y,z = min min{p x,z,p y,z } x,y z 5. roof of the main result x,y z For proving the main result we make use of an approach inspired by Woodard et al. []. The basic idea behind this approach is to partition the state space of the T chain into subsets, such that the Markov chain mixes well inside the single subsets at all temperatures and between the subsets at the lowest temperature. To begin with, a suitable partitioning of the RBM state space is needed. For a binary RBM with m visible and n hidden neurons the state space is = {0, } n+m.leta = {A j : j =,..., n } be the partition of where each subset A j contains all states having the same state of the hidden neurons, that is, A j = {(v, h) h = h j } if h j denotes the j- th state of the hidden random variables. Thus, we get n subsets A,...,A n and each A j contains m elements. Using this partition, we can now prove theorem based on the results given in the previous section. Our proof can be divided into five steps, where steps and are taken from Woodard et al. []. The basic thoughts behind the steps of the proof can be outlined as follows. The gap of the T transition matrix depends on (a) how well the single tempered chains mix inside the single subsets A,...,A n,(b)howwell the chain at the highest temperature (i.e. the chain with inverse temperature 0) mixesbetweenthesubsets,and(c)themixingpropertiesbetweenthe chains at the di erent temperatures. In step, Gap( )isboundedinterms of the gaps of two transition matrices, one depending on (a) and the other depending on (b) and (c). The first one is further bounded in steps and 3 and the second one in steps 4. Step 5 puts everything together. 5.. Step : Bounding Gap( ) in terms of Gap( ) and Gap( ) Consider the state x =(x [0],...,x [N] ) N+ of the T chain. Let the signature be the vector s(x) =( 0,..., N) with t = j if x [t] A j for t =0,...,N. Since the partition of consists out of n subsets A j and we have N + temperatures, a signature lives in = {,..., n } N+.
13 For a fixed letusnowdefine = {x N+ : s(x) = }. Then all possible induceapartition{ } of the T-state space T = N+. Let = now be the restriction of to as defined in equation (0). And let denote the projection matrix of with respect to { } as defined in (). Now we can apply theorem and get Gap( ) Gap( )min Gap( ). () 5.. Step : Bounding Gap( ) in terms of min t,j Gap(T t Aj ) Woodard et al. now proceed by bounding Gap( )andgap( )separately. For bounding Gap( )theynotethat,since = QT Q and Q has a holding probability, (x, y) T (x, y) 8x, y. With theorem 4 it 4 follows Gap( ) Gap(T ). As T is a product chain, theorem 5 gives 4 Gap(T )= Thus, it follows (N +) min Gap(T t t ) t (N +) min Gap(T t Aj ). t,j Gap( ) 8(N +) min Gap(T t Aj ). (3) t,j 5.3. Step 3: Bounding min t,j Gap(T t Aj ) We will now drive a lower bound for min t,j Gap(T t Aj ). The transition matrix T t analyzed here for (tempered) RBMs corresponds to randomized blockwise Gibbs sampling as described above. The transition probability from (v, h) to(v 0, h) isgivenby T t ((v, h)), (v 0, h)) = t(v 0 h), and the probability to change the state of the hidden variables is accordingly T t ((v, h)), (v, h 0 )) = t(h 0 v). Based on (0) for the restriction of T t to A j it holds: T t Aj ((v, h j ), (v 0, h j )) = T t ((v, h j ), (v 0, h j )) + I {(v,hj )}((v 0, h j ))T t ((v, h j ),A c j) 3
14 for (v, h j ), (v 0, h j ) A j.thus,forv 6= v 0 T t Aj ((v, h j ), (v 0, h j )) = t(v 0 h j ), and the probability to stay in the same state is T t Aj ((v, h j ), (v, h j )) = t(v h j )+ h Let us denote the SLEM of T t Aj by Tt by Tt.Basedontheorem6itholds SLEM t(h v) = t(v h j )+. and the second largest eigenvalue Gap(T t A j )= T t T t SLEM D(T t Aj ). (4) To upper bound the Dobrushin s coe cient of T t Aj first note that for all v, v 0 : min{t t Aj ((v, h j ), (ˆv, h j )),T t Aj ((v 0, h j ), (ˆv, h j ))} ˆv = t(ˆv h j )+minn t(v h j ), t(v h j )+ o ˆv:ˆv6=v^ˆv6=v 0 n +min t(v 0 h j ), t(v 0 h j )+ o = ˆv t(ˆv h j )=. Now it is easy to see that D(T t Aj )= = and insertion into (4) gives Gap(T t Aj ) D(T t Aj ).Thus,wefinallygetbyinsertioninto equation (3) Gap( ) 6(N +). (5) 5.4. Step 4: Bounding Gap( ) For bounding Gap( ) first note that is reversible with respect to the probability mass function ( )= T ( )= NY t (A t ), 8, t=0 4
15 and that according to definition () for any moving from to under is given by (, )= T ( ) x,, the probability of y T (x) (x, y). (6) We proceed by bounding Gap( )basedontheorem3bycomparing to another -reversible transition matrix T.ThetransitionmatrixT chooses t uniformly from {0,...,N} and then draws t with the probability t (A t ). To ease the notation let us now denote [i,j] =( 0,..., i,j, i+,..., N). Now we can define the transition matrix T by T (, [i,j]) = N + i(a j ) and T (, 0 )=0if8i, j : 0 6= [i,j].fortheapplicationoftheorem3,for each edge (, [i,j]) inthetransitiongraphoft let us define a path, [i,j] in the transition graph of as follows: Stage. change 0 to j; Stage. swap j up to level i; Stage 3. swap new i (formerly i ) down tolevel0; Stage 4. change value at level 0 to 0 (from former i); We derive an upper bound of the constant c of theorem 3 by splitting it into three terms and bounding each term separately. Here c is the maximum with respect to and (with ( ) (, ) > 0) of, [i,j], [i,j] :(, ), [i,j] {z } Term I ( ) ( ) {z } Term II T (, [i,j]). (, ) {z } Term III The product of the three upper bounds on the three terms I, II, and III will be used as an upper bound on c. Bounding Term I. First, we want to find an upper bound for, (7) [i,j], [i,j] :(, ), [i,j] 5
16 for the above-defined paths and for any edge (, ) inthegraphof. Let us start with finding an upper bound for the length, of a [i,j] path. Since i {0,...,N} stage and 3 each can at most consist out of N swapping moves. Stage and 4 each contain only one transition. Thus, applen +=(N +). [i,j] We will now upper bound the number of paths that go through any edge (, ). If the edge corresponds to stage of the path = and = [0,j], and since i {0,...,N} there are not more that N + paths containing this edge. A similar argumentation holds for edges in stage 4. If the edge is in stage of the path =(,..., l,j, l+,..., N) for some l {0,...,i} and =(,..., l, l+,j,..., N). Here, 0 is unknown and has n possible values. With i {0,...,N} there are not more then n (N +)pathscontaininganedgeinstageofthepath. Similarly,there are no more than n (N +) paths containing a certain edge in stage 3 of the path. Since the same edge could either be found in stage or 4 of the path or in stage or 3 each edge can be in two stages, each corresponding to at most n (N +) paths. Therefore, the total number of paths containing a single edge is no more than n+ (N +) = max{n +, n (N +)} and thus, [i,j] applen+ (N +). (8), [i,j] :(, ), [i,j] Bounding Term II. For bounding ( ) first note that any state in the stages ( ) or of the path from to [i,j] as given above is of the form = (,..., l,j, l+,..., N) forsomel {0,...,i}. Therefore, " ly # ( ) ( ) = k (A k ) 0 (A 0 ). (9) k (A k ) l (A j ) k= For the Boltzmann distribution of RBMs we have 0 (A 0 ) l (A j ) = Z l Z 0 v exp( E(v, h ) 0 ) v exp( E(v, h j) l ) and for the first term on the left side of equation (9) ly k= k (A k ) k (A k ) = Z 0 Z l ly k= v exp( E(v, h k ) k) v exp( E(v, h k ) k ). 6
17 Now consider that v exp( E(v, h k ) k) v exp( E(v, h k ) k ) apple because we can write v exp( E(v, h k ) k) v exp( E(v, h k ) k ) = v exp( min h E(v, h) k ) v exp( min h E(v, h) k ) (0) v exp( E(v, h k ) k +max v,h E(v, h)) v exp( E(v, h k ) k +max v,h E(v, h)), which makes all arguments of the exponential function in the terms of denominator and numerator nonnegative. The function k +exp( x k +y) is R k +exp( x k +y) R monotonically decreasing in x for x apple y for k k 0, and for each value of v we can write x = E(v, h k )andy =max v,h E(v, h) and fix R k and R k to the remaining terms in numerator and denominator, respectively. Thus, the expression gets maximal if we replace x by min h E(v, h). This can be done for all values of v. Sowecanwrite: ( ) ( ) apple Z 0 Z l " ly k= = v exp( min h E(v, h) k ) v exp( min h E(v, h) k ) v exp( min h E(v, h) l ) v exp( min h E(v, h) 0 ) 0=0 = # Z l Z 0 v exp( min h E(v, h) l ) v exp( E(v, h apple j) l ) Any state in stage 3 of the path is of the form for some l {0,...,i v exp( E(v, h ) 0 ) v exp( E(v, h j) l ) v exp( E(v, h ) 0 ) v exp( E(v, h j) l ) v exp( min h E(v, h)) v exp( max h E(v, h)) =(,..., l, i, l+,..., i,j, i+,..., N) }. Thus, " ly ( ) ( ) = k= # k (A k ) k (A k ) 0 (A 0 ) i (A i ) i (A j ) l (A i ). 7
18 In an analogous way as above we get ( ) ( ) apple v exp( min h E(v, h) l ) v exp( min h E(v, h) 0 ) v exp( min h E(v, h) 0 ) v exp( E(v, h v exp( min h E(v, h) i ) j) i ) v exp( min h E(v, h) l ) v = exp( min h E(v, h) i ) v exp( E(v, h apple j) i ) Any state in stage 4 is given by =( i,,..., i Therefore, ( ) ( ) = 0(A 0 ) i (A i ) 0 (A i ) i (A j ) = v exp( E(v, h 0) 0 ) v exp( E(v, h i) 0 ) 0=0 = v exp( E(v, h ) i i) v exp( E(v, h j) i ) apple v exp( min h E(v, h)) v exp( max h E(v, h)). v exp( E(v, h ) i i) v exp( E(v, h j) i ),j, i+,..., N). v exp( min h E(v, h)) v exp( max h E(v, h)). Thus, by now we have shown that for all states in the path we have ( ) ( ) apple v exp( min h E(v, h)) v exp( max h E(v, h)). () Bounding Terms III. Now we will bound the remaining term T (, [i,j]).we (, ) have T (, [i,j]) = N + i(a j ) apple N + v exp( min h E(v, h)) n v exp( max h E(v, h)). () For bounding (, ), we consider two cases. Any edge (, ) onthepath [, [i,j]] can be one of the following two types: Case. It is an edge where we get from =( 0,..., N ) by replacing 0 by any new state 0 and, thus, = [0, 0 ] =( 0,,..., N ). Case. It is an edge which we obtain by swapping two elements t and t+ and, thus, =( 0,..., t+, t,..., N ). Let us analyse these two cases separately: 8
19 Case : Based on (6) and by (, ) = T ( ) x y T (x) (x, y) (x, y) Q(x, x)t (x, y)q(x, x) = T (x, y), 4 we get (, ) 4 T ( ) x T (x)t (x, y). y Because we consider the first case, we can restrict the inner sum to y that di er from x only in the state of the lowest chain, since T (x, y) =0 otherwise. That is, for x =(x [0], x [],...,x [N] ) we restrict the inner sum to all y =(y [0], x [],...,x [N] )withy [0] A 0.Inthiscasewehave T (x, y) = (N +) T 0(x [0], y [0] ). Furthermore, for x [0] =(v [0], h [0] )thereisonlyoney [0] A 0 that can be reached by one transition of T 0,namelyy [0] =(v [0], h 0 ) with probability T 0 (v [0], h [0] )= for n 0 =0(probabilityof to sample a new state of the n hidden variables, which are uniformly distributed). And thus (, ) T (x) 4 T ( ) 4(N +) = n 6(N +), n x where we use x T (x) = T ( ). Thus, we get T (, (, ) [i,j]) apple 6 n (N +)T (, [i,j]). Using () this is bounded from above by 4 v exp( min h E(v, h)) v exp( max h E(v, h)). (3) 9
20 Case : For bounding (, ) in the second case, we can make use of (x, y) Q(x, y), because = QT Q and the swap can either occur under the first or the second Q and the probability of holding under T and the other Q is.thus,weget 4 (, ) T (x)q(x, y) T ( ) y x = Q N t=0 t(a ) [t] x y t=0 = 4N t (A [t] ) t+ (A [t+] ) x [t] A [t] NY t (x [t] )Q(x, y) x [t+] A [t+] t (x [t] ) t+ (x [t+] )min, t+(x [t]) t (x [t+]) t (x [t] ) t+ (x [t+] ) where in the last step we made use of the facts that [i] = [i] for i {0,...,t,t+,...,N} and that the probabilities to propose and accept a certain swap according to Q are N and min, t+(x [t] ) t(x [t+] ) t(x [t] ) t+ (x [t+] ),respectively. We further have and x [t] A [t] t (x [t] ) t+ (x [t+] )min x [t+] A [t+] = Z t Z t+ v ˆv, t+(x [t]) t (x [t+]) t (x [t] ) t+ (x [t+] ) exp( E(v, h [t] ) t )exp( E(ˆv, h [t+] ) t+ ) min, exp( E(v, h [t] ) t+ )exp( E(ˆv, h [t+] ) t ) exp( E(v, h [t] ) t )exp( E(ˆv, h [t+] ) t+ ) = exp( E(v, h [t] ) t )exp( E(ˆv, h [t+] ) t+ ) Z t Z t+ v ˆv min{, exp (E(v, h [t] ) E(ˆv, h [t+] ))( t+ t ) } exp (max E(v, h) min Z t Z E(v, h))( t+ t) t+ v,h v,h exp( E(v, h [t] ) t ) ˆv exp( E(ˆv, h [t+] ) t+ ) v 0,
21 t (A [t] ) t+ (A [t+] ) = Z t Z t+ ( v exp( E(v, h [t] ) t ))( ˆv exp( E(ˆv, h [t+] ) t+ )) and thus a bound of (, ) isgivenby (, ) 4N exp max t ( t+ t )(max v,h E(v, h)) min E(v, h)). v,h Using max v,h E(v, h) min E(v, h)) apple, v,h with = i,j w ij + j b j + i c i summing the absolute values of parameters of the RBM, we get (, ) exp( max t ( t+ t ) ) 4N. (4) From (4) and the upper bound for T (, [i,j]) givenin()itfollows in the current case that T (, [i,j]) 4 apple v exp( min h E(v, h)) (, ) n exp( max t ( t+ t ) ) v exp( max h E(v, h)). (5) Combining Terms I, II, III. By joining the results (3) and (5) for term III with the upper bound for term II given in () we get in case ( )T (, [i,j]) ( ) (, ) apple 4 ( v exp( min h E(v, h)) ( v exp( max h E(v, h))) (6) and in case ( )T (, [i,j]) ( ) (, ) 4 ( v apple exp( min h E(v, h))) n exp( max t ( t+ t ) ) ( v exp( max h E(v, h))). (7)
22 If we combine these results for the two cases (6) and (7) we get = 4( v exp( min h E(v, h)) ( v exp( max h E(v, h))) max 4 ( v apple max exp( min h E(v, h)) ( v exp( max h E(v, h))), ( )T (, [i,j]) ( ) (, ) 4 ( v exp( min h E(v, h))) n exp( max t ( t+ t ) ) ( v exp( max h E(v, h))) 4, n exp( max t ( t+ t ) ). (8) utting this and (8) together, we arrive at an upper bound for the constant c from theorem 3: c apple 6(N +) n ( v exp( min h E(v, h))) ( v exp( max h E(v, h))) max, n exp( max t ( t+ t ) ) = 6(N +) ( v exp( min h E(v, h))) ( v exp( max h E(v, h))) max n+, exp( max t ( t+ t ) ) Thus, applying theorem 3 and theorem 5, from which follows that Gap(T )= (N +) because we have b 0 = b = = b N =/(N +) and all component chains of the product chain T have a spectral gap of, we get: Gap( ) c Gap(T )= ( v exp( max h E(v, h))) 6(N +) 3 ( v exp( min h E(v, h))) min, exp( max( n+ t+ t ) ). (9) t
23 5.5. Step 5: utting it all together Using (9) and (5) in () leads to Gap( ) Gap( )min Gap( ) ( v exp( max h E(v, h))) 8 (N +) 4 ( v exp( min h E(v, h))) min, exp( max( n+ t+ t ) ), t which we further bound by making use of the factorisation of the marginal distribution of the hidden RBM variables. As can, for example, be seen from equation (0) in [3], the marginal distribution of the hidden units can be written as (h) = Z v exp( E(v, h)) = Q n Z i= exp(c ih i ) Q m j= (+exp(b j+ n i= w ijh i )). This leads to v exp( max h E(v, h))) v exp( min h E(v, h)) Q n i= apple exp(min Q m h i c i h i ) j= Q ( + exp(b j + n i= min h i w ij h i )) n i= exp(max Q h i c i h i ) m j= ( + exp(b j + n i= max h i w ij h i )) n my ( + exp(b j + n i= =exp( c i ) w iji [,0] (w ij ))) ( + exp(b i j= j + n i= w iji [0,+] (w ij ))) my =exp( c) f j. with c = n i c i and f j = ( + exp(b j + n i= w iji [,0] (w ij ))) ( + exp(b j + n i= w iji [0,+] (w ij ))) = sig( b n j i= w iji [0,+] (w ij )) sig( b n j i= w iji [,0] (w ij )) = min h (v j = h) max h (v j = h), where sig(x) =. Therefore, f +exp( x) j is just the fraction of the minimal and maximal activation of the jth visible unit of the RBM. j= 3
24 Thus, we get Gap( ) exp( c) 8 (N +) 4 exp( c) =min n+0 (N +) 4 my fj min, exp( max( n+ t+ t ) ) t j= my fj, exp( c max t ( t+ t ) ) my f 8 (N +) 4 j. j= It is possible to repeat the proof while changing the roles of hidden and visible variables, which finally leads to (6) and proves our main result. 6. Discussion Theorem and corollary are the first non-trivial results on the convergence rate of T for sampling from binary RBMs. As shown in corollary, they lead to an upper bound on the approximation error when applying the analyzed T method to gradient estimation in RBM training. The upper bound depends exponentially on the size of one layer and the absolute values of the RBM parameters. The fewer the number of nodes and/or the smaller the amount of the parameters, the faster the convergence. This intuitive result resembles the bounds on the approximation bias in contrastive divergence learning [7]. Actually, our current hypothesis is that RBM T chains are in general not rapidly mixing, and that it is in general not possible to get rid of the exponential dependencies in the RBM sampling complexity. This is the case for some related models such as variants of the otts model [6, ]. Because our analysis considers the convergence to the stationary distribution of the product chain consisting of all replica chains, we get an undesired additional linear dependency on the number of replicas. However, such a dependency seems to be inevitable for this type of analysis, and we are not aware of any approach bounding the original chain. The terms f j, j =,...,m,andg i, i =,...,n,arethefractionsof the minimal and maximal probability to be activated of visible neuron j and hidden neuron i given states of the respective other layer. They are measures of how strongly neurons vary with changing input. The observed dependency is in accordance with the expectation that highly variable chains require more samples to reach the stationary distribution. Furthermore, the minimal and maximal probabilities to be activated give information about how close the 4 j=
25 conditional distributions are to being uniformly distributed. Gibbs sampling relies on these conditional distributions, and it is optimal if all neurons are activated with probability 0.5 (itreachesthestationarydistributioninonly one sampling step in this case) and gets more and more deterministic as closer these probabilities get to zero or one and, thus, mixing slows down as our bound indicates. When inspecting the role of the inverse temperatures in (6), we see that the term max t ( t+ t )getsminimal,if 0,..., N are uniformly distributed between 0 and. In this case max t ( t+ t )=.Thistheoreticalresultis in N accordance with the heuristic choice of a uniform inverse temperature spacing as used in many applications (e.g., see [9, ]). So far, to our knowledge, the only theoretically justified heuristic for spacing the inverse temperatures in the tempered transitions literature is choose the 0,..., N geometrically (i.e., i+/ i =const)[3,4]. Geometrictemperaturespacinghasbeen shown to be optimal for sampling from Gaussian distributions by Neil [3], and this result has been extended to a more general class of distribution by Behrens et al. [4]. However, RBMs do not fall into this class and a geometric spacing is sub-optimal for sampling RBMs [5]. Desjardins et al. suggest to use an adaptation scheme that dynamically adjusts the inverse temperatures during RBM training (starting from a uniform distribution). In their experiments, the higher inverse temperatures are adapted to values that are much closer together than a geometric spacing would suggest [5], which also supports rather a uniform than a geometric static inverse temperature spacing. We regard the T variant we analyzed as the typical algorithm considered in theoretical analyses of T, for instance by Madras and Zheng [4] and Woodard et al. []. Still, it di ers from T most often used for RBMs in practice. However, as mentioned in section.4, in this T variant the corresponding transition operator is not reversible because of several reasons. Firstly, it consists of an update step in which blockwise Gibbs sampling is performed in all tempered chains. This is not reversible due to the fixed order of first sampling h k+ based on p(h v k )followedbysamplingv k+ based on p(v h k+ ), where the superscripts denote the sampling steps. Secondly, the swapping is usually organized in two substeps, where even temperatures are considered in the first and odd temperatures in the second substep. And, last but not least, we have = TQ instead of = QT Q. Lacking reversibility prohibits our type of analysis, and we are not aware of other strategies not requiring the property. Of course, one can design T 5
26 operators that are closer to the one used for RBM sampling in practice but still reversible. For example, we can define = QT Q,chooseT to perform an update of all visible or all hidden variables (each picked with probability 0.5) in all tempered chains simultaneously, and Q to trying to swap with equal probability either all even or all odd chains in parallel. We have adapted our analysis to this T variant. However, the resulting bound was actually less tight (among others due to the fact that T is not defined as an product chain anymore, but is of the form T = Q N t=0 T t,andthustheorem5cannot be applied) and the proof did not provide additional insights. 7. Conclusion arallel tempering (T) is state-of-the-art for sampling restricted Boltzmann machines (RBMs). We presented a first analysis of the convergence rate of the Markov chains of the T algorithm as considered by Madras and Zheng [4] and Woodard et al. [] for sampling RBMs by deriving an arguably loose, but non-trivial bound on the spectral gap. We find an exponential dependency on the maximum size of the two layers and the absolute values of the RBM parameters in accordance with existing bounds on the approximation bias in contrastive divergence learning. We hypothesise that RBM T chains are in general not rapidly mixing, and one can in general not get rid of the exponential dependencies on the number of variables in the RBM sampling complexity. We regard proving conditions under which RBMs are torpid mixing as an interesting question for future research. Our bound on the spectral gap gets minimal if the inverse temperatures are spaced uniformly, which is a common choice in practice. We do not claim that this spacing is optimal, in particular an adaptive spacing seems to be a good idea. Still, the result matches the practical experience that a uniform spacing (either fixed or as starting point for adaptation) is preferable to a geometric one when using T for sampling RBMs. The bound on the convergence rate naturally leads to a bound on the gradient approximation error of the method when used during RBM training, which resembles bounds on the approximation error of contrastive divergence learning [7]. 6
27 References []. Smolensky, Information rocessing in Dynamical Systems: Foundations of Harmony Theory, in: D. E. Rumelhart, J. L. McClelland (Eds.), arallel Distributed rocessing: Explorations in the Microstructure of Cognition, vol. : Foundations, MIT ress, 94 8, 986. [] G. E. Hinton, Training roducts of Experts by Minimizing Contrastive Divergence, Neural Computation 4 (00) [3] A. Fischer, C. Igel, Training Restricted Boltzmann Machines: An Introduction, attern Recognition 47 (03) [4] G. E. Hinton, R. R. Salakhutdinov, Reducing the Dimensionality of Data with Neural Networks, Science 33 (5786) (006) [5] A. Fischer, C. Igel, Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines, in: K. Diamantaras, W. Duch, L. S. Iliadis (Eds.), International Conference on Artificial Neural Networks (ICANN 00), vol of LNCS, Springer-Verlag, 08 7, 00. [6] Y. Bengio, O. Delalleau, Justifying and Generalizing Contrastive Divergence, Neural Computation (6) (009) [7] A. Fischer, C. Igel, Bounding the Bias of Contrastive Divergence Learning, Neural Computation 3 (0) [8] C. J. Geyer, Markov chain Monte Carlo maximum likelihood, in: E. Kerami (Ed.), roceedings of the 3rd Symposium on the Interface of Computing Science and Statistics, Interface Foundation of North America, 56 63, 99. [9] R. Salakhutdinov, Learning in Markov Random Fields using Tempered Transitions, in: Y. Bengio, D. Schuurmans, J. La erty, C. K. I. Williams, A. Culotta (Eds.), Advances in Neural Information rocessing Systems (NIS ), , 009. [0] G. Desjardins, A. Courville, Y. Bengio,. Vincent, O. Dellaleau, arallel Tempering for Training of Restricted Boltzmann Machines, Journal of Machine Learning Research Workshop and Conference roceedings 9 (AISTATS 00) (00)
28 [] K. Cho, T. Raiko, A. Ilin, arallel tempering is e cient for learning restricted Boltzmann machines, in: roceedings of the International Joint Conference on Neural Networks (IJCNN 00), IEEE ress, , 00. [] D. B. Woodard, S. C. Schmidler, M. Huber, Conditions for Rapid Mixing of arallel and Simulated Tempering on Multimodal Distributions, The Annals of Applied robability 9 (009) [3]. Diaconis, L. Salo -Coste, What do we know about the Metropolis algorithm?, Journal of Computer and System Sciences 57 (998) [4] N. Madras, Z. Zheng, On the swapping algorithm, Random Structures Algorithms (003) [5] K. Brügge, A. Fischer, C. Igel, The flip-the-state transition operator for Restricted Boltzmann Machines, Machine Learning 3 (03) [6] N. Bhatnagar, D. Randall, Torpid Mixing of Simulated Tempering on the otts Model, in: roceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Society for Industrial and Applied Mathematics, hiladelphia, A, USA, , 004. [7] S. Caracciolo, A. elissetto, A. D. Sokal, Two remarks on simulated tempering, Unpublished manuscript. [8] N. Madras, D. Randall, Markov chain decomposition for convergence rate analysis, The Annals of Applied robability (58 606). [9]. Diaconis, L. Salo -Coste, Comparison theorems for reversible Markov chains, The Annals of Applied robability 3 (993) [0]. Brémaud, Markov chains: Gibbs fields, Monte Carlo simulation, and queues, Springer-Verlag, 999. []. Diaconis, L. Salo -Coste, Logarithmic Sobolev inequalities for finite Markov chains, The Annals of Applied robability 6 (996) [] D. Woodard, S. Schmidler, M. Huber, Su cient Conditions for Torpid Mixing of arallel and Simulated Tempering, Electronic Journal of robability 4 (009)
29 [3] R. Neal, Sampling from multimodal distributions using tempered transitions, Statistics and Computing 6 (996) [4] G. Behrens, N. Friel, M. Hurn, Tuning tempered transitions, Statistics and Computing () (0) [5] G. Desjardins, A. Courville, Y. Bengio, Adaptive arallel Tempering for Stochastic Maximum Likelihood Learning of RBMs, in: NIS 00 Workshop on Deep Learning and Unsupervised Feature Learning, 00. Appendix A. Bounding the bias of a MCMC-estimate in terms of the convergence rate The convergence rate of a Markov chain can be used to bound the bias of an approximation of the expected value of a function under the model distribution based on samples from the chain. This is demonstrated in the following. Consider a Markov chain on the state space. Let be the stationary distribution of the chain and k (z, ) be the distribution obtained by running the Markov chain for k steps starting from a fixed sample z. Furthermore, let t :! R be an arbitrary function. Then, a bound on the bias of an estimate that approximates the expected value of t(x) under based on samples from k (z, ) is given by (x)t(x) x k (z, x))t(x) = x apple ( (x) x x ( (x) k (z, x)t(x) k (z, x))t(x) apple max t(x) (x) k (z, x x = max t(x) k k (z, )k where k k (z, )k is the variation distance, which can be bounded, for example, using equation (3). 9
30 Appendix B. Relation between the convergence rates of the T product chain and the original chain In the following, we proof that bounding the convergence rate of the T product chain also bounds the convergence rate of the original chain (i.e., the chain with inverse temperature N = ). Assume that for the product chain holds NY k (y, x) t (x [t] ) <, x t=0 for an arbitrary starting point y. Thenwecanwrite k (y, x) x [ N] x [N] NY t (x [t] ) <, t=0 which implies > x [N] x [ N] k (y, x) NY t (x [t] ) t=0 x [N] x [ N] ( k (y, x) NY t (x [t] )). t=0 Thus, also upper bounds the variation distance for the original chain at temperature : x [N] NY ( k (y, x) t (x [t] )) = k (y, x) x [ N] t=0 x [N] x [ N] = NY k N(y, x [0] ) t (x [t] ) x [N] t=0 x [ N] x [ N] t=0 NY t (x [t] ) = NY k N(y, x [0] ) N (x [0] ) t (x [t] ) x [N] x [ N] t=0 = N(y, k x [0] ) N (x [0] ) < x [N] 30
Introduction to Restricted Boltzmann Machines
Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,
More informationThe flip-the-state transition operator for restricted Boltzmann machines
Mach Learn 2013) 93:53 69 DOI 10.1007/s10994-013-5390-3 The flip-the-state transition operator for restricted Boltzmann machines Kai Brügge Asja Fischer Christian Igel Received: 14 February 2013 / Accepted:
More informationEmpirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines
Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines Asja Fischer and Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum,
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationStochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence
ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient
More informationDeep Belief Networks are compact universal approximators
1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation
More informationParallel Tempering is Efficient for Learning Restricted Boltzmann Machines
Parallel Tempering is Efficient for Learning Restricted Boltzmann Machines KyungHyun Cho, Tapani Raiko, Alexander Ilin Abstract A new interest towards restricted Boltzmann machines (RBMs) has risen due
More informationDeep unsupervised learning
Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.
More informationEnhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines
Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines KyungHyun Cho KYUNGHYUN.CHO@AALTO.FI Tapani Raiko TAPANI.RAIKO@AALTO.FI Alexander Ilin ALEXANDER.ILIN@AALTO.FI Department
More informationBias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions
- Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions Simon Luo The University of Sydney Data61, CSIRO simon.luo@data61.csiro.au Mahito Sugiyama National Institute of
More informationDeep Learning Srihari. Deep Belief Nets. Sargur N. Srihari
Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous
More informationKnowledge Extraction from DBNs for Images
Knowledge Extraction from DBNs for Images Son N. Tran and Artur d Avila Garcez Department of Computer Science City University London Contents 1 Introduction 2 Knowledge Extraction from DBNs 3 Experimental
More informationLearning Deep Architectures
Learning Deep Architectures Yoshua Bengio, U. Montreal Microsoft Cambridge, U.K. July 7th, 2009, Montreal Thanks to: Aaron Courville, Pascal Vincent, Dumitru Erhan, Olivier Delalleau, Olivier Breuleux,
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationThe Origin of Deep Learning. Lili Mou Jan, 2015
The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets
More informationMeasuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information
Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Mathias Berglund, Tapani Raiko, and KyungHyun Cho Department of Information and Computer Science Aalto University
More informationLarge-Scale Feature Learning with Spike-and-Slab Sparse Coding
Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab
More informationChapter 16. Structured Probabilistic Models for Deep Learning
Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe
More informationLearning Deep Boltzmann Machines using Adaptive MCMC
Ruslan Salakhutdinov Brain and Cognitive Sciences and CSAIL, MIT 77 Massachusetts Avenue, Cambridge, MA 02139 rsalakhu@mit.edu Abstract When modeling high-dimensional richly structured data, it is often
More informationRestricted Boltzmann Machines
Restricted Boltzmann Machines Boltzmann Machine(BM) A Boltzmann machine extends a stochastic Hopfield network to include hidden units. It has binary (0 or 1) visible vector unit x and hidden (latent) vector
More informationLearning Tetris. 1 Tetris. February 3, 2009
Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are
More informationRepresentational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber
Representational Power of Restricted Boltzmann Machines and Deep Belief Networks Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber Introduction Representational abilities of functions with some
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationConnections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN
Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University
More informationMixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines
Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines Christopher Tosh Department of Computer Science and Engineering University of California, San Diego ctosh@cs.ucsd.edu Abstract The
More informationDeep Boltzmann Machines
Deep Boltzmann Machines Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel University of Illinois Urbana Champaign agoel10@illinois.edu December 2, 2016 Ruslan Salakutdinov and Geoffrey E. Hinton Amish
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationMarkov Chains and MCMC
Markov Chains and MCMC Markov chains Let S = {1, 2,..., N} be a finite set consisting of N states. A Markov chain Y 0, Y 1, Y 2,... is a sequence of random variables, with Y t S for all points in time
More informationarxiv: v1 [cs.ne] 6 May 2014
Training Restricted Boltzmann Machine by Perturbation arxiv:1405.1436v1 [cs.ne] 6 May 2014 Siamak Ravanbakhsh, Russell Greiner Department of Computing Science University of Alberta {mravanba,rgreiner@ualberta.ca}
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationThe Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision
The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationRestricted Boltzmann Machines for Collaborative Filtering
Restricted Boltzmann Machines for Collaborative Filtering Authors: Ruslan Salakhutdinov Andriy Mnih Geoffrey Hinton Benjamin Schwehn Presentation by: Ioan Stanculescu 1 Overview The Netflix prize problem
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More information16 : Approximate Inference: Markov Chain Monte Carlo
10-708: Probabilistic Graphical Models 10-708, Spring 2017 16 : Approximate Inference: Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Yuan Yang, Chao-Ming Yen 1 Introduction As the target distribution
More informationProbabilistic Graphical Models
10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem
More informationMachine Learning for Data Science (CS4786) Lecture 24
Machine Learning for Data Science (CS4786) Lecture 24 Graphical Models: Approximate Inference Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016sp/ BELIEF PROPAGATION OR MESSAGE PASSING Each
More informationGreedy Layer-Wise Training of Deep Networks
Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn
More informationMarkov Networks.
Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function
More informationContrastive Divergence
Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer Science TU Graz June 9, 2010 Contents 1 Theory 2 Argument 3 Contrastive
More informationComputer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo
Group Prof. Daniel Cremers 11. Sampling Methods: Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative
More informationBayesian Learning in Undirected Graphical Models
Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul
More informationDecomposition Methods and Sampling Circuits in the Cartesian Lattice
Decomposition Methods and Sampling Circuits in the Cartesian Lattice Dana Randall College of Computing and School of Mathematics Georgia Institute of Technology Atlanta, GA 30332-0280 randall@math.gatech.edu
More informationMixture Models and Representational Power of RBM s, DBN s and DBM s
Mixture Models and Representational Power of RBM s, DBN s and DBM s Guido Montufar Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany. montufar@mis.mpg.de Abstract
More informationChapter 11. Stochastic Methods Rooted in Statistical Mechanics
Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science
More informationVariational Inference. Sargur Srihari
Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms
More informationSpeaker Representation and Verification Part II. by Vasileios Vasilakakis
Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation
More informationChapter 20. Deep Generative Models
Peng et al.: Deep Learning and Practice 1 Chapter 20 Deep Generative Models Peng et al.: Deep Learning and Practice 2 Generative Models Models that are able to Provide an estimate of the probability distribution
More informationLearning Energy-Based Models of High-Dimensional Data
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal
More informationDensity estimation. Computing, and avoiding, partition functions. Iain Murray
Density estimation Computing, and avoiding, partition functions Roadmap: Motivation: density estimation Understanding annealing/tempering NADE Iain Murray School of Informatics, University of Edinburgh
More informationProbabilistic Models in Theoretical Neuroscience
Probabilistic Models in Theoretical Neuroscience visible unit Boltzmann machine semi-restricted Boltzmann machine restricted Boltzmann machine hidden unit Neural models of probabilistic sampling: introduction
More informationEnergy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13
Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 13 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 1 / 21 Summary Story so far Representation: Latent
More informationDeep Generative Models. (Unsupervised Learning)
Deep Generative Models (Unsupervised Learning) CEng 783 Deep Learning Fall 2017 Emre Akbaş Reminders Next week: project progress demos in class Describe your problem/goal What you have done so far What
More information17 : Markov Chain Monte Carlo
10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo
More informationAn Empirical Investigation of Minimum Probability Flow Learning Under Different Connectivity Patterns
An Empirical Investigation of Minimum Probability Flow Learning Under Different Connectivity Patterns Daniel Jiwoong Im, Ethan Buchman, and Graham W. Taylor School of Engineering University of Guelph Guelph,
More informationLect4: Exact Sampling Techniques and MCMC Convergence Analysis
Lect4: Exact Sampling Techniques and MCMC Convergence Analysis. Exact sampling. Convergence analysis of MCMC. First-hit time analysis for MCMC--ways to analyze the proposals. Outline of the Module Definitions
More informationKyle Reing University of Southern California April 18, 2018
Renormalization Group and Information Theory Kyle Reing University of Southern California April 18, 2018 Overview Renormalization Group Overview Information Theoretic Preliminaries Real Space Mutual Information
More informationInference in Bayesian Networks
Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationProbabilistic Graphical Models Lecture 17: Markov chain Monte Carlo
Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,
More informationCOMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017
COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines COMP9444 17s2 Boltzmann Machines 1 Outline Content Addressable Memory Hopfield Network Generative Models Boltzmann Machine Restricted Boltzmann
More informationAlgorithms other than SGD. CS6787 Lecture 10 Fall 2017
Algorithms other than SGD CS6787 Lecture 10 Fall 2017 Machine learning is not just SGD Once a model is trained, we need to use it to classify new examples This inference task is not computed with SGD There
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationLearning MN Parameters with Alternative Objective Functions. Sargur Srihari
Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient
More informationTempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines
Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines Desjardins, G., Courville, A., Bengio, Y., Vincent, P. and Delalleau, O. Technical Report 1345, Dept. IRO, Université de
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationarxiv: v1 [cs.lg] 8 Oct 2015
Empirical Analysis of Sampling Based Estimators for Evaluating RBMs Vidyadhar Upadhya and P. S. Sastry arxiv:1510.02255v1 [cs.lg] 8 Oct 2015 Indian Institute of Science, Bangalore, India Abstract. The
More informationExponential families also behave nicely under conditioning. Specifically, suppose we write η = (η 1, η 2 ) R k R p k so that
1 More examples 1.1 Exponential families under conditioning Exponential families also behave nicely under conditioning. Specifically, suppose we write η = η 1, η 2 R k R p k so that dp η dm 0 = e ηt 1
More informationCS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling
CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling Professor Erik Sudderth Brown University Computer Science October 27, 2016 Some figures and materials courtesy
More informationMarkov Chain Monte Carlo The Metropolis-Hastings Algorithm
Markov Chain Monte Carlo The Metropolis-Hastings Algorithm Anthony Trubiano April 11th, 2018 1 Introduction Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability
More information6 Markov Chain Monte Carlo (MCMC)
6 Markov Chain Monte Carlo (MCMC) The underlying idea in MCMC is to replace the iid samples of basic MC methods, with dependent samples from an ergodic Markov chain, whose limiting (stationary) distribution
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationSampling Methods (11/30/04)
CS281A/Stat241A: Statistical Learning Theory Sampling Methods (11/30/04) Lecturer: Michael I. Jordan Scribe: Jaspal S. Sandhu 1 Gibbs Sampling Figure 1: Undirected and directed graphs, respectively, with
More informationMachine Learning. Probabilistic KNN.
Machine Learning. Mark Girolami girolami@dcs.gla.ac.uk Department of Computing Science University of Glasgow June 21, 2007 p. 1/3 KNN is a remarkably simple algorithm with proven error-rates June 21, 2007
More informationReading Group on Deep Learning Session 4 Unsupervised Neural Networks
Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen 206-09-22 Jakob Verbeek & Daan Wynen Unsupervised Neural Networks Outline Autoencoders Restricted) Boltzmann
More informationCS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash
CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash Equilibrium Price of Stability Coping With NP-Hardness
More informationLearning and Evaluating Boltzmann Machines
Department of Computer Science 6 King s College Rd, Toronto University of Toronto M5S 3G4, Canada http://learning.cs.toronto.edu fax: +1 416 978 1455 Copyright c Ruslan Salakhutdinov 2008. June 26, 2008
More informationARestricted Boltzmann machine (RBM) [1] is a probabilistic
1 Matrix Product Operator Restricted Boltzmann Machines Cong Chen, Kim Batselier, Ching-Yun Ko, and Ngai Wong chencong@eee.hku.hk, k.batselier@tudelft.nl, cyko@eee.hku.hk, nwong@eee.hku.hk arxiv:1811.04608v1
More informationMarkov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018
Graphical Models Markov Chain Monte Carlo Inference Siamak Ravanbakhsh Winter 2018 Learning objectives Markov chains the idea behind Markov Chain Monte Carlo (MCMC) two important examples: Gibbs sampling
More informationComputer Vision Group Prof. Daniel Cremers. 14. Sampling Methods
Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric
More informationModeling Documents with a Deep Boltzmann Machine
Modeling Documents with a Deep Boltzmann Machine Nitish Srivastava, Ruslan Salakhutdinov & Geoffrey Hinton UAI 2013 Presented by Zhe Gan, Duke University November 14, 2014 1 / 15 Outline Replicated Softmax
More informationAnnealing Between Distributions by Averaging Moments
Annealing Between Distributions by Averaging Moments Chris J. Maddison Dept. of Comp. Sci. University of Toronto Roger Grosse CSAIL MIT Ruslan Salakhutdinov University of Toronto Partition Functions We
More informationarxiv: v2 [cs.ne] 22 Feb 2013
Sparse Penalty in Deep Belief Networks: Using the Mixed Norm Constraint arxiv:1301.3533v2 [cs.ne] 22 Feb 2013 Xanadu C. Halkias DYNI, LSIS, Universitè du Sud, Avenue de l Université - BP20132, 83957 LA
More informationThe connection of dropout and Bayesian statistics
The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University
More informationUnsupervised Learning
CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality
More informationUNSUPERVISED LEARNING
UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training
More informationGaussian-Bernoulli Deep Boltzmann Machine
Gaussian-Bernoulli Deep Boltzmann Machine KyungHyun Cho, Tapani Raiko and Alexander Ilin Aalto University School of Science Department of Information and Computer Science Espoo, Finland {kyunghyun.cho,tapani.raiko,alexander.ilin}@aalto.fi
More informationApproximation properties of DBNs with binary hidden units and real-valued visible units
Approximation properties of DBNs with binary hidden units and real-valued visible units Oswin Krause Oswin.Krause@diku.dk Department of Computer Science, University of Copenhagen, 100 Copenhagen, Denmark
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationDeep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 8: Autoencoder & DBM Princeton University COS 495 Instructor: Yingyu Liang Autoencoder Autoencoder Neural networks trained to attempt to copy its input to its output Contain
More informationComputational statistics
Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated
More informationFirst Hitting Time Analysis of the Independence Metropolis Sampler
(To appear in the Journal for Theoretical Probability) First Hitting Time Analysis of the Independence Metropolis Sampler Romeo Maciuca and Song-Chun Zhu In this paper, we study a special case of the Metropolis
More informationInductive Principles for Restricted Boltzmann Machine Learning
Inductive Principles for Restricted Boltzmann Machine Learning Benjamin Marlin Department of Computer Science University of British Columbia Joint work with Kevin Swersky, Bo Chen and Nando de Freitas
More informationCSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection
CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection
More information18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)
10-708: Probabilistic Graphical Models 10-708, Spring 2014 18 : Advanced topics in MCMC Lecturer: Eric P. Xing Scribes: Jessica Chemali, Seungwhan Moon 1 Gibbs Sampling (Continued from the last lecture)
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More information