ABoundfortheConvergenceRateofParallel Tempering for Sampling Restricted Boltzmann Machines

Size: px
Start display at page:

Download "ABoundfortheConvergenceRateofParallel Tempering for Sampling Restricted Boltzmann Machines"

Transcription

1 ABoundfortheConvergenceRateofarallel Tempering for Sampling Restricted Boltzmann Machines Asja Fischer a,b,christianigel b a Institut für Neuroinformatik, Ruhr-Universität Bochum, Bochum, Germany b Department of Computer Science, University of Copenhagen, 00 Copenhagen, Denmark Abstract Sampling from restricted Boltzmann machines (RBMs) is done by Markov chain Monte Carlo (MCMC) methods. The faster the convergence of the Markov chain, the more e ciently can high quality samples be obtained. This is also important for robust training of RBMs, which usually relies on sampling. arallel tempering (T), an MCMC method that maintains several replicas of the original chain at higher temperatures, has been successfully applied for RBM training. We present the first analysis of the convergence rate of T for sampling from binary RBMs. The resulting bound on the rate of convergence of the T Markov chain shows an exponential dependency on the size of one layer and the absolute values of the RBM parameters. It is minimized by a uniform spacing of the inverse temperatures, which is often used in practice. Similar as in the derivation of bounds on the approximation error for contrastive divergence learning, our bound on the mixing time implies an upper bound on the error of the gradient approximation when the method is used for RBM training. Keywords: restricted Boltzmann machines, parallel tempering, mixing time. Introduction Restricted Boltzmann machines (RBMs) are probabilistic graphical models corresponding to stochastic neural networks [, ] (see [3] for a recent review). They are applied in many machine learning tasks, notably they serve as building blocks of deep belief networks [4]. Markov chain Monte reprint submitted to Theoretical Computer Science September 9, 05

2 Carlo (MCMC) methods are used to sample from RBMs, and chains that quickly converge to their stationary distribution are desirable to e ciently get high quality samples. Adaptation of the RBM model parameters typically corresponds to gradient-based likelihood maximization given training data. As computing the exact gradient is usually computationally not tractable, sampling-based methods are employed to approximate the likelihood gradient. It has been shown that inaccurate approximations can deteriorate the learning process (e.g., [5]), and for the most popular learning scheme contrastive divergence learning (CD, []) the approximation quality has been analyzed [6, 7]. The quality of the approximation depends, among other things, on how quickly the Markov chain approaches the stationary distribution, that is, on its mixing rate. To improve RBM learning, parallel tempering (T, [8]) has successfully been used as a sampling method in RBM training [9, 0,, 3]. arallel tempering introduces supplementary Gibbs chains that sample from smoothed replicas of the original distribution with the goal of improving the mixing rate. However, so far there exist no published attempts to analyze the mixing rate of T applied to RBMs. Based on the work by Woodard et al. [], we provide the first such analysis. After introducing the basic concepts, section 3statesourmainresult. Section4summarizesgeneraltheoremsrequiredfor our proof in section 5, which is followed by a discussion and our conclusions.. Background In the following, we will give a brief introduction to RBMs and the relation between the mixing rate and the spectral gap of a Markov chain. Afterwards we will describe the parallel tempering algorithm and its application to sampling from RBMs... Restricted Boltzmann machines Restricted Boltzmann machines (RBMs) are probabilistic undirected graphical models (Markov random fields). Their structure is a bipartite graph connecting a set of m visible random variables V =(V,V,...,V m )modeling observations to n hidden (latent) random variables H =(H,H,...,H n ) capturing dependencies between the visible variables. In binary RBMs the state space of one single variable is given by = {0, } and accordingly (V, H) {0, } m+n. The joint probability distribution of (V, H) isgiven

3 by the Gibbs distribution with energy E(v, h) = (v, h) = exp( n i= m h i w ij v j j= E(v, h)) Z m j= b j v j, () n c i h i i= and real-valued connection weights w ij and bias parameters b j and c i for i {,...,n} and j {,...,m}. The normalization constant Z, also called the partition function, is given by Z = v,h exp( E(v, h)). Training an RBM means adapting its parameters such that the distribution of V models a distribution underlying some observed data. In practice, this training corresponds to performing stochastic gradient ascent on the loglikelihood of the weight and bias parameters given sample (training) data. The gradient of the log-likelihood given a single training sample v train is given ln (v train ) (h v train train, h) h) h v,h, () where is the vector collecting all parameters. Since the expectation under the model distribution in the second term on the right hand side can not be computed e ciently (it is exponential in min(n, m)), it is approximated by MCMC methods in RBM training algorithms. Typically, the expected value under the model distribution is approximated by (h v h given a sample v (k) obtained by running a Markov chain for k steps. Alternatively, we can consider (v h (k) ) given a sample h (k) to v computation time if m<n... Mixing rates and the spectral gap AhomogeneousMarkovchainonadiscretestatespace canbedescribed by a transition matrix =(p x,y ) x,y,wherep x,y is the probability to move from x to y in one step of the Markov chain. We also refer to this probability as (x, y), and accordingly k (x, y) gives the probability to move from x to y in k steps of the chain. 3

4 Markov chain Monte Carlo methods make use of the fact that an ergodic Markov chain on with transition matrix and equilibrium distribution satisfies k (x, y)! (y) ask!for all x, y. It is important to study the convergence rate of the Markov chain in order to find out how large k has to be to ensure that k (x, y) issuitablycloseto (y). One way to measure the closeness to stationarity is the total variation distance k k (x, ) k = y k (x, y) (y) for an arbitrary starting state x. Reversibility of the chain implies that the eigenvalues of are real-valued, and we sort them by value = r. The value Gap( )= is called the spectral gap of, and the eigenvalue with the second largest absolute value SLEM =max{, r } is often referred to as second largest eigenvalue modulus (SLEM). Many bounds on convergence rates are based on the SLEM. Diaconis and Salo -Coste [3], for example, prove k k (x, ) k apple p (x) SLEMk. (3) If is positive definite all eigenvalues are non-negative. In this case SLEM = and we can replace SLEM in (3) (and similar bounds) by Gap( ). We can make an arbitrary transition matrix Q positive definite by skipping the move each time with probability, that is, by considering the transition matrix = I + Q. This makes in the long run two times slower than Q. AtypicalapplicationofMCMCmethods(e.g.,inthetrainingofRBMs) is to approximate the expected value of a given function under the stationary distribution based on samples obtained after running the Markov chain for k steps. A bound on the chain s convergence rate directly leads to a bound on the bias of this approximation (see Appendix A for a proof)..3. arallel tempering Consider a multi modal target density on a state space from which one would like to sample via a Markov chain. The transition steps of the Metropolis-Hastings algorithm move only locally in space so that the resulting Markov chain may move between the modes of only infrequently. The T algorithm [8] tries to overcome this by introducing supplementary Markov chains, which are flattened or smoothed versions of. These (and the original density) are referred to as tempered densities t, t =0,...,N fulfilling t (z) / (z) t for z withinverse temperatures t. The inverse 4

5 temperatures satisfy 0 apple 0 < < N =. aralleltemperingnowconstructs a Markov chain on the joint state space T = N+ of the tempered distributions. The chain takes values x =(x [0],...,x [N] ) T and converges to the stationary distribution NY T (x) = t (x [t] ). t=0 The T transition matrix consists out of two components, one for updating the states of the individual tempered chains and one allowing swaps between states of adjacent tempered chains. We will denote the corresponding transition matrices by T and Q, respectively. Fortheconsiderationsin this paper, we add a probability of of staying in the current state each time T or Q is applied to guarantee positive definiteness and thus positive eigenvalues. We define the T matrix as = QT Q. Atransitionmatrixconstructed like this assures reversibility (unlike TQ)andisusedintheanalysis by Madras and Zheng [4] and Woodard et al. []. Here the matrices T and Q are defined as follows: T chooses t {0,...,N} uniformly and updates the state of t based on some transition matrix T t,whichisreversiblewithrespectto t. This can, for example, be the Metropolis-Hastings or a Gibbs matrix [5]. Thus, the probability of moving from x =(x [0],...,x [N] ) T to y =(y [0],...,y [N] ) T under T is given by T (x, y) = (N +) N T t (x [t], y [t] )I {x[ t] }(y [ t] )+ I {x}(y), t=0 where x [ t] =(x [0],...,x [t ], x [t+],...,x [N] ) N collects all components of x except the t-th one and, for any set B, I B (y) =ify B and I B (y) =0, otherwise. The I {x}(y)ensuresnon-negativeeigenvaluesasdiscussedabove. The swapping matrix Q proposes to swap state x [t] and x [t+] by sampling t uniformly from {0,...,N }. The proposed swap is accepted with the Metropolis probability a(x,t)=min, t(x [t+] ) t+ (x [t] ) t (x [t] ) t+ (x [t+] ), (4) which guarantees detailed balance. So the transition probability from x = (x [0],...,x [N] )toy =(x [0],...,x [t+], x [t],...x [N] )isgivenby Q(x, y) = N a(x,t), 5

6 the probability to stay in x is Q(x, x) = N t=0 N a(x,t), and the transition probability between all other states is 0. By construction both T and Q are reversible with respect to T and strictly positive definite due to their holding probability. It follows from the definition of that both properties also hold for. Thus, has only positive real-valued eigenvalues and the convergence of the T chain to the stationary distribution T can be bounded in terms of (3) by replacing SLEM by, where is a lower bound for Gap( )..4. Training RBMs with T The T algorithm is often the sampling procedure of choice for RBM learning algorithms to approximate the gradient of the log-likelihood given in equation ()[0, ]. In this setting, the tempered distributions are defined as t (v, h) = exp( E(v, h) t) Z t (5) with Z t = v,h exp( E(v, h) t)and 0 is typically set to 0, which we will also assume in our analysis. The transition matrices T t for the single tempered chains are set to a randomized version of blockwise Gibbs sampling: we choose to update either V or H with equal probability and then update all neurons in this layer simultaneously. The random choice of the layer leads to reversibility of T t. In this paper, we want to find a lower bound for the spectral gap of T sampling as defined above applied to RBMs. In the T algorithm most often used for RBM training in practice the transition operator di ers from the one analyzed in this paper. It consists out of an update step followed by a swapping step. During the update step, one step of blockwise Gibbs sampling is performed in all tempered chains in parallel, where one step of blockwise Gibbs sampling corresponds to sampling a new state h 0 of H based on the conditional probability (h 0 v) giventhecurrentstateofthevisible variables v, followedbysamplinganewstatev 0 of V based on (v 0 h 0 ). During the swapping step, the states at all temperatures may be swapped based on the Metropolis probability a(x,t)asdefinedin(4),withx =(v, h). The swapping is often organized in two substeps, where even temperatures 6

7 are considered in the first and odd temperatures in the second substep. While this approach seems to work in practice, the corresponding transition matrix need not be reversible, which makes it di cult to obtain theoretical results on its mixing behavior, see our discussion in section Main Result Let denote the Gibbs distribution of an RBM as defined in () and for 0 = 0 < < N = let t be the corresponding tempered Gibbs distribution with inverse temperature t as given in (5). Then the stationary distribution of T sampling for RBMs as defined above is given by T ((x [0],...,x [N] )) = Q N t=0 t(x [t] )andforthespectralgapofthecorresponding transition matrix it holds: Theorem. For the T transition matrix of an RBM with m visible and n hidden variables exp( c) my Gap( ) min f n+0 (N +) 4 j, exp( c max t ( t+ t ) ) my f 8 (N +) 4 j exp( b) m+0 (N +) 4 j= ny i= g i, exp( b max t ( t+ t ) ) 8 (N +) 4 j= ny gi. i= (6) with = i,j w ij + j b j + i c i, being the sum over the absolute values of parameters of the RBM, b = j b j and c = i c i being the sum of over the absolute values of the bias parameters of the visible and hidden variables, respectively, and f j = min h (v j = h) max h (v j = h) and g i = min v (h i = v) max v (h i = v) being the fraction of the minimal and maximal activation of the j-th visible and the i-th hidden neuron. 7

8 The proof is given in section 5. By combining this result with equation (3), we arrive at an upper bound on the rate of convergence of the T chain. Corollary. Let be the T transition matrix for an RBM with m visible and n hidden variables and let T be the joint distribution over all tempered chains. Then for any starting point of the Markov chain x =(x [0],...,x [N] ), with x [i] =(v [i], h [i] ), the distance in variation to T is bounded by k k (x, ) T kapple p T (x) ( )k (7) with exp( c) =min n+0 (N +) 4 exp( b) m+0 (N +) 4 my j= ny i= and, b, c, f j, and g i as defined above. f j, exp( c max t ( t+ t ) ) 8 (N +) 4 g i, exp( b max t ( t+ t ) ) 8 (N +) 4 Theorem bounds the gap of the T product chain. As the original chain (the chain with inverse temperature N) neverconvergesslowerthan this product chain (see Appendix B for a proof), corollary also leads to an upper bound for the convergence rate of the chain of interest. Bounding the product chain leads to the dependency on the number N of supplementary tempered chains. However, to our knowledge all approaches analyzing T su er from this drawback (e.g., [, 4, 6]). Now, we can make use of the fact that a bound on the convergence rate implies a bound on the bias of an MCMC estimate and obtain a bound on the bias of the T based estimator when used in RBM training. Recall that for training RBMs the log-likelihood gradient given in equation () is approximated by replacing the expected value under the model distribution in the second term, (v, (h v (k) for v,h h asamplev (k) obtained by running the T-chain for k steps. The value of (h v (k) is bounded by and thus, using the result in h Awithmaxt(x) =,wegetthefollowingcorollary: my j= ny i= f j g i 8

9 Corollary. Given an RBM with m visible and n hidden variables and a T chain starting from x =(x [0],...,x [N] ), with x [i] =(v [i], h [i] ). Let T(x (k) ) be the approximation of the log-likelihood derivative w.r.t some RBM parameter obtained by estimating the second term in equation () based on a T-sample x (k) =(x (k) [0],...,x(k) [N]), with x(k) [i] =(v (k) [i], h(k) [i] ), gained by running the chain for k steps. That is, for a training sample v train T(x (k) )= (h v train train, h) + )@E(v(k) (h v (k) [0], h h (8) Then we can bound the absolute value of the bias of T(x (k) ) ln (v train x (k) k (x, x (k) ) T (x (k) ) apple where is defined as in corollary. p T (x) ( )k, (9) Similar considerations lead to an alternative proof of the approximation error of contrastive divergence learning [7] by combining the result in Appendix A with standard bounds on the periodic Gibbs sampler (e.g. [0], page 89). 4. Bounding the Spectral Gap of T This section summarizes definitions and results required for the proof of theorem. We state important results for bounding the spectral gap of general Markov chains and recall the link between the SLEM and Dobrushin s coe cient. 4.. Definitions For any transition matrix reversible with respect to a distribution and any subset A of the state space of,therestriction of to A is defined as: A (x,b)= (x,b)+i B (x) (x,a c ) (0) for x A, B A. InthiswaytransitionsthatwouldleaveA are prohibited and the probabilities to stay in A are increased correspondingly. 9

10 Given a partition A = {A j : j =,...,J} of a finite such that (A j )= xa j (x) > 0 for all j, theprojection matrix of with respect to A is given by the transition operator with for i, j {,...,J}. (i, j) = (A i ) xa i ya j (x) (x, y) () 4.. General results Given a partition A of the state space as defined above, the following theorem allows to lower bound the spectral gap of the T transition matrix in terms of the spectral gap of the projection matrix and the minimum over the spectral gaps of the restrictions of to the subsets A j. Theorem. Let be a transition matrix reversible with respect to a distribution on a state space. Let {A j : i =,...,J} be any partition of such that (A j ) > 0 for all j. Define Aj as in (0) and as in (). If is nonnegative definite, it holds Gap( ) Gap( ) min j=,...,j Gap( A j ). As Woodard et al. [] show this theorem can be proven based on the results from Caracciolo et al. [7] (which were first published in [8]) and the results from Madras and Zheng [4]. Often, one wishes to bound the spectral gap of one chain based on the spectral gap of another chain, which, for example, is easier to estimate. In this case the following theorem can be applied. Combining the theorem for the comparison of the Dirichlet-forms of two Markov chains given by Diaconis and Salo -Coste [9] with Reyleigh s theorem (e.g., [0], p. 05) we get: Theorem 3. Let and Q be transition matrices on a finite state space, reversible with respect to densities and Q respectively. Denote by E = {(x, y) : (x) (x, y) > 0} and E Q = {(x, y) : Q (x)q(x, y) > 0} the edge sets of the corresponding transition graphs. For each pair x 6= y 0

11 such that (x, y) E Q fix a path x,y =(x = x 0, x,...,x k = y) of length x,y = k such that (x i, x i+ ) E for i {0,...,k } and define c = max x,y Q (x)q(x, y). (z,w)e (z) (z, w) x,y:(z,w) x,y Then it holds Gap(Q) apple c Gap( ). If both chains are reversible with respect to the same distribution, the following theorem proven in [] can be applied: Theorem 4. Let and Q be transition matrices on a state space reversible with respect. If Q(x,A\{x}) apple (x,a\{x}) for every x and every A then Gap(Q) apple Gap( ). Recall that the T transition matrix T combines N +transitionoperators T 0,...,T N by taking values x =(x [0],...,x [N] )inthejointstatespace T and performing a transition by randomly picking t {0,...,N} and sampling a new value x [t] for the t-th chain from T t. Such a Markov chain is called a product chain. The following theorem deals with product chains and gives the dependency between the spectral gap of the product chain and the spectral gaps of all single chains it subsumes ([], lemma 3.): Theorem 5. For any natural number N and t =0,...,N, let t be a t - reversible transition matrix on a state space t. Let be the transition matrix on = Q t t given by (x, y) = N b t t (x [t], y [t] )I {x[ t] }(y [ t] ), x, y t=0 for some set of b t > 0 such that t b t =and where x [t] denotes the t-th component of x (corresponding to the state of the t-th chain) and x [ t] all components except the t-th one. A Markov chain with transition matrix is called a product chain. It is reversible with respect to (x) = Q t t(x [t] ) and Gap( ) = min b t Gap( t ). t=0,...,n The next theorem is a well known result that gives an upper bound on the SLEM (e.g., see [0], p. 37).

12 Theorem 6. The second largest eigenvalue modulus SLEM of a transition matrix =(p x,y ) x,y can be bounded from above by Dobrushin s coe cient: D( )= max p x,z p y,z = min min{p x,z,p y,z } x,y z 5. roof of the main result x,y z For proving the main result we make use of an approach inspired by Woodard et al. []. The basic idea behind this approach is to partition the state space of the T chain into subsets, such that the Markov chain mixes well inside the single subsets at all temperatures and between the subsets at the lowest temperature. To begin with, a suitable partitioning of the RBM state space is needed. For a binary RBM with m visible and n hidden neurons the state space is = {0, } n+m.leta = {A j : j =,..., n } be the partition of where each subset A j contains all states having the same state of the hidden neurons, that is, A j = {(v, h) h = h j } if h j denotes the j- th state of the hidden random variables. Thus, we get n subsets A,...,A n and each A j contains m elements. Using this partition, we can now prove theorem based on the results given in the previous section. Our proof can be divided into five steps, where steps and are taken from Woodard et al. []. The basic thoughts behind the steps of the proof can be outlined as follows. The gap of the T transition matrix depends on (a) how well the single tempered chains mix inside the single subsets A,...,A n,(b)howwell the chain at the highest temperature (i.e. the chain with inverse temperature 0) mixesbetweenthesubsets,and(c)themixingpropertiesbetweenthe chains at the di erent temperatures. In step, Gap( )isboundedinterms of the gaps of two transition matrices, one depending on (a) and the other depending on (b) and (c). The first one is further bounded in steps and 3 and the second one in steps 4. Step 5 puts everything together. 5.. Step : Bounding Gap( ) in terms of Gap( ) and Gap( ) Consider the state x =(x [0],...,x [N] ) N+ of the T chain. Let the signature be the vector s(x) =( 0,..., N) with t = j if x [t] A j for t =0,...,N. Since the partition of consists out of n subsets A j and we have N + temperatures, a signature lives in = {,..., n } N+.

13 For a fixed letusnowdefine = {x N+ : s(x) = }. Then all possible induceapartition{ } of the T-state space T = N+. Let = now be the restriction of to as defined in equation (0). And let denote the projection matrix of with respect to { } as defined in (). Now we can apply theorem and get Gap( ) Gap( )min Gap( ). () 5.. Step : Bounding Gap( ) in terms of min t,j Gap(T t Aj ) Woodard et al. now proceed by bounding Gap( )andgap( )separately. For bounding Gap( )theynotethat,since = QT Q and Q has a holding probability, (x, y) T (x, y) 8x, y. With theorem 4 it 4 follows Gap( ) Gap(T ). As T is a product chain, theorem 5 gives 4 Gap(T )= Thus, it follows (N +) min Gap(T t t ) t (N +) min Gap(T t Aj ). t,j Gap( ) 8(N +) min Gap(T t Aj ). (3) t,j 5.3. Step 3: Bounding min t,j Gap(T t Aj ) We will now drive a lower bound for min t,j Gap(T t Aj ). The transition matrix T t analyzed here for (tempered) RBMs corresponds to randomized blockwise Gibbs sampling as described above. The transition probability from (v, h) to(v 0, h) isgivenby T t ((v, h)), (v 0, h)) = t(v 0 h), and the probability to change the state of the hidden variables is accordingly T t ((v, h)), (v, h 0 )) = t(h 0 v). Based on (0) for the restriction of T t to A j it holds: T t Aj ((v, h j ), (v 0, h j )) = T t ((v, h j ), (v 0, h j )) + I {(v,hj )}((v 0, h j ))T t ((v, h j ),A c j) 3

14 for (v, h j ), (v 0, h j ) A j.thus,forv 6= v 0 T t Aj ((v, h j ), (v 0, h j )) = t(v 0 h j ), and the probability to stay in the same state is T t Aj ((v, h j ), (v, h j )) = t(v h j )+ h Let us denote the SLEM of T t Aj by Tt by Tt.Basedontheorem6itholds SLEM t(h v) = t(v h j )+. and the second largest eigenvalue Gap(T t A j )= T t T t SLEM D(T t Aj ). (4) To upper bound the Dobrushin s coe cient of T t Aj first note that for all v, v 0 : min{t t Aj ((v, h j ), (ˆv, h j )),T t Aj ((v 0, h j ), (ˆv, h j ))} ˆv = t(ˆv h j )+minn t(v h j ), t(v h j )+ o ˆv:ˆv6=v^ˆv6=v 0 n +min t(v 0 h j ), t(v 0 h j )+ o = ˆv t(ˆv h j )=. Now it is easy to see that D(T t Aj )= = and insertion into (4) gives Gap(T t Aj ) D(T t Aj ).Thus,wefinallygetbyinsertioninto equation (3) Gap( ) 6(N +). (5) 5.4. Step 4: Bounding Gap( ) For bounding Gap( ) first note that is reversible with respect to the probability mass function ( )= T ( )= NY t (A t ), 8, t=0 4

15 and that according to definition () for any moving from to under is given by (, )= T ( ) x,, the probability of y T (x) (x, y). (6) We proceed by bounding Gap( )basedontheorem3bycomparing to another -reversible transition matrix T.ThetransitionmatrixT chooses t uniformly from {0,...,N} and then draws t with the probability t (A t ). To ease the notation let us now denote [i,j] =( 0,..., i,j, i+,..., N). Now we can define the transition matrix T by T (, [i,j]) = N + i(a j ) and T (, 0 )=0if8i, j : 0 6= [i,j].fortheapplicationoftheorem3,for each edge (, [i,j]) inthetransitiongraphoft let us define a path, [i,j] in the transition graph of as follows: Stage. change 0 to j; Stage. swap j up to level i; Stage 3. swap new i (formerly i ) down tolevel0; Stage 4. change value at level 0 to 0 (from former i); We derive an upper bound of the constant c of theorem 3 by splitting it into three terms and bounding each term separately. Here c is the maximum with respect to and (with ( ) (, ) > 0) of, [i,j], [i,j] :(, ), [i,j] {z } Term I ( ) ( ) {z } Term II T (, [i,j]). (, ) {z } Term III The product of the three upper bounds on the three terms I, II, and III will be used as an upper bound on c. Bounding Term I. First, we want to find an upper bound for, (7) [i,j], [i,j] :(, ), [i,j] 5

16 for the above-defined paths and for any edge (, ) inthegraphof. Let us start with finding an upper bound for the length, of a [i,j] path. Since i {0,...,N} stage and 3 each can at most consist out of N swapping moves. Stage and 4 each contain only one transition. Thus, applen +=(N +). [i,j] We will now upper bound the number of paths that go through any edge (, ). If the edge corresponds to stage of the path = and = [0,j], and since i {0,...,N} there are not more that N + paths containing this edge. A similar argumentation holds for edges in stage 4. If the edge is in stage of the path =(,..., l,j, l+,..., N) for some l {0,...,i} and =(,..., l, l+,j,..., N). Here, 0 is unknown and has n possible values. With i {0,...,N} there are not more then n (N +)pathscontaininganedgeinstageofthepath. Similarly,there are no more than n (N +) paths containing a certain edge in stage 3 of the path. Since the same edge could either be found in stage or 4 of the path or in stage or 3 each edge can be in two stages, each corresponding to at most n (N +) paths. Therefore, the total number of paths containing a single edge is no more than n+ (N +) = max{n +, n (N +)} and thus, [i,j] applen+ (N +). (8), [i,j] :(, ), [i,j] Bounding Term II. For bounding ( ) first note that any state in the stages ( ) or of the path from to [i,j] as given above is of the form = (,..., l,j, l+,..., N) forsomel {0,...,i}. Therefore, " ly # ( ) ( ) = k (A k ) 0 (A 0 ). (9) k (A k ) l (A j ) k= For the Boltzmann distribution of RBMs we have 0 (A 0 ) l (A j ) = Z l Z 0 v exp( E(v, h ) 0 ) v exp( E(v, h j) l ) and for the first term on the left side of equation (9) ly k= k (A k ) k (A k ) = Z 0 Z l ly k= v exp( E(v, h k ) k) v exp( E(v, h k ) k ). 6

17 Now consider that v exp( E(v, h k ) k) v exp( E(v, h k ) k ) apple because we can write v exp( E(v, h k ) k) v exp( E(v, h k ) k ) = v exp( min h E(v, h) k ) v exp( min h E(v, h) k ) (0) v exp( E(v, h k ) k +max v,h E(v, h)) v exp( E(v, h k ) k +max v,h E(v, h)), which makes all arguments of the exponential function in the terms of denominator and numerator nonnegative. The function k +exp( x k +y) is R k +exp( x k +y) R monotonically decreasing in x for x apple y for k k 0, and for each value of v we can write x = E(v, h k )andy =max v,h E(v, h) and fix R k and R k to the remaining terms in numerator and denominator, respectively. Thus, the expression gets maximal if we replace x by min h E(v, h). This can be done for all values of v. Sowecanwrite: ( ) ( ) apple Z 0 Z l " ly k= = v exp( min h E(v, h) k ) v exp( min h E(v, h) k ) v exp( min h E(v, h) l ) v exp( min h E(v, h) 0 ) 0=0 = # Z l Z 0 v exp( min h E(v, h) l ) v exp( E(v, h apple j) l ) Any state in stage 3 of the path is of the form for some l {0,...,i v exp( E(v, h ) 0 ) v exp( E(v, h j) l ) v exp( E(v, h ) 0 ) v exp( E(v, h j) l ) v exp( min h E(v, h)) v exp( max h E(v, h)) =(,..., l, i, l+,..., i,j, i+,..., N) }. Thus, " ly ( ) ( ) = k= # k (A k ) k (A k ) 0 (A 0 ) i (A i ) i (A j ) l (A i ). 7

18 In an analogous way as above we get ( ) ( ) apple v exp( min h E(v, h) l ) v exp( min h E(v, h) 0 ) v exp( min h E(v, h) 0 ) v exp( E(v, h v exp( min h E(v, h) i ) j) i ) v exp( min h E(v, h) l ) v = exp( min h E(v, h) i ) v exp( E(v, h apple j) i ) Any state in stage 4 is given by =( i,,..., i Therefore, ( ) ( ) = 0(A 0 ) i (A i ) 0 (A i ) i (A j ) = v exp( E(v, h 0) 0 ) v exp( E(v, h i) 0 ) 0=0 = v exp( E(v, h ) i i) v exp( E(v, h j) i ) apple v exp( min h E(v, h)) v exp( max h E(v, h)). v exp( E(v, h ) i i) v exp( E(v, h j) i ),j, i+,..., N). v exp( min h E(v, h)) v exp( max h E(v, h)). Thus, by now we have shown that for all states in the path we have ( ) ( ) apple v exp( min h E(v, h)) v exp( max h E(v, h)). () Bounding Terms III. Now we will bound the remaining term T (, [i,j]).we (, ) have T (, [i,j]) = N + i(a j ) apple N + v exp( min h E(v, h)) n v exp( max h E(v, h)). () For bounding (, ), we consider two cases. Any edge (, ) onthepath [, [i,j]] can be one of the following two types: Case. It is an edge where we get from =( 0,..., N ) by replacing 0 by any new state 0 and, thus, = [0, 0 ] =( 0,,..., N ). Case. It is an edge which we obtain by swapping two elements t and t+ and, thus, =( 0,..., t+, t,..., N ). Let us analyse these two cases separately: 8

19 Case : Based on (6) and by (, ) = T ( ) x y T (x) (x, y) (x, y) Q(x, x)t (x, y)q(x, x) = T (x, y), 4 we get (, ) 4 T ( ) x T (x)t (x, y). y Because we consider the first case, we can restrict the inner sum to y that di er from x only in the state of the lowest chain, since T (x, y) =0 otherwise. That is, for x =(x [0], x [],...,x [N] ) we restrict the inner sum to all y =(y [0], x [],...,x [N] )withy [0] A 0.Inthiscasewehave T (x, y) = (N +) T 0(x [0], y [0] ). Furthermore, for x [0] =(v [0], h [0] )thereisonlyoney [0] A 0 that can be reached by one transition of T 0,namelyy [0] =(v [0], h 0 ) with probability T 0 (v [0], h [0] )= for n 0 =0(probabilityof to sample a new state of the n hidden variables, which are uniformly distributed). And thus (, ) T (x) 4 T ( ) 4(N +) = n 6(N +), n x where we use x T (x) = T ( ). Thus, we get T (, (, ) [i,j]) apple 6 n (N +)T (, [i,j]). Using () this is bounded from above by 4 v exp( min h E(v, h)) v exp( max h E(v, h)). (3) 9

20 Case : For bounding (, ) in the second case, we can make use of (x, y) Q(x, y), because = QT Q and the swap can either occur under the first or the second Q and the probability of holding under T and the other Q is.thus,weget 4 (, ) T (x)q(x, y) T ( ) y x = Q N t=0 t(a ) [t] x y t=0 = 4N t (A [t] ) t+ (A [t+] ) x [t] A [t] NY t (x [t] )Q(x, y) x [t+] A [t+] t (x [t] ) t+ (x [t+] )min, t+(x [t]) t (x [t+]) t (x [t] ) t+ (x [t+] ) where in the last step we made use of the facts that [i] = [i] for i {0,...,t,t+,...,N} and that the probabilities to propose and accept a certain swap according to Q are N and min, t+(x [t] ) t(x [t+] ) t(x [t] ) t+ (x [t+] ),respectively. We further have and x [t] A [t] t (x [t] ) t+ (x [t+] )min x [t+] A [t+] = Z t Z t+ v ˆv, t+(x [t]) t (x [t+]) t (x [t] ) t+ (x [t+] ) exp( E(v, h [t] ) t )exp( E(ˆv, h [t+] ) t+ ) min, exp( E(v, h [t] ) t+ )exp( E(ˆv, h [t+] ) t ) exp( E(v, h [t] ) t )exp( E(ˆv, h [t+] ) t+ ) = exp( E(v, h [t] ) t )exp( E(ˆv, h [t+] ) t+ ) Z t Z t+ v ˆv min{, exp (E(v, h [t] ) E(ˆv, h [t+] ))( t+ t ) } exp (max E(v, h) min Z t Z E(v, h))( t+ t) t+ v,h v,h exp( E(v, h [t] ) t ) ˆv exp( E(ˆv, h [t+] ) t+ ) v 0,

21 t (A [t] ) t+ (A [t+] ) = Z t Z t+ ( v exp( E(v, h [t] ) t ))( ˆv exp( E(ˆv, h [t+] ) t+ )) and thus a bound of (, ) isgivenby (, ) 4N exp max t ( t+ t )(max v,h E(v, h)) min E(v, h)). v,h Using max v,h E(v, h) min E(v, h)) apple, v,h with = i,j w ij + j b j + i c i summing the absolute values of parameters of the RBM, we get (, ) exp( max t ( t+ t ) ) 4N. (4) From (4) and the upper bound for T (, [i,j]) givenin()itfollows in the current case that T (, [i,j]) 4 apple v exp( min h E(v, h)) (, ) n exp( max t ( t+ t ) ) v exp( max h E(v, h)). (5) Combining Terms I, II, III. By joining the results (3) and (5) for term III with the upper bound for term II given in () we get in case ( )T (, [i,j]) ( ) (, ) apple 4 ( v exp( min h E(v, h)) ( v exp( max h E(v, h))) (6) and in case ( )T (, [i,j]) ( ) (, ) 4 ( v apple exp( min h E(v, h))) n exp( max t ( t+ t ) ) ( v exp( max h E(v, h))). (7)

22 If we combine these results for the two cases (6) and (7) we get = 4( v exp( min h E(v, h)) ( v exp( max h E(v, h))) max 4 ( v apple max exp( min h E(v, h)) ( v exp( max h E(v, h))), ( )T (, [i,j]) ( ) (, ) 4 ( v exp( min h E(v, h))) n exp( max t ( t+ t ) ) ( v exp( max h E(v, h))) 4, n exp( max t ( t+ t ) ). (8) utting this and (8) together, we arrive at an upper bound for the constant c from theorem 3: c apple 6(N +) n ( v exp( min h E(v, h))) ( v exp( max h E(v, h))) max, n exp( max t ( t+ t ) ) = 6(N +) ( v exp( min h E(v, h))) ( v exp( max h E(v, h))) max n+, exp( max t ( t+ t ) ) Thus, applying theorem 3 and theorem 5, from which follows that Gap(T )= (N +) because we have b 0 = b = = b N =/(N +) and all component chains of the product chain T have a spectral gap of, we get: Gap( ) c Gap(T )= ( v exp( max h E(v, h))) 6(N +) 3 ( v exp( min h E(v, h))) min, exp( max( n+ t+ t ) ). (9) t

23 5.5. Step 5: utting it all together Using (9) and (5) in () leads to Gap( ) Gap( )min Gap( ) ( v exp( max h E(v, h))) 8 (N +) 4 ( v exp( min h E(v, h))) min, exp( max( n+ t+ t ) ), t which we further bound by making use of the factorisation of the marginal distribution of the hidden RBM variables. As can, for example, be seen from equation (0) in [3], the marginal distribution of the hidden units can be written as (h) = Z v exp( E(v, h)) = Q n Z i= exp(c ih i ) Q m j= (+exp(b j+ n i= w ijh i )). This leads to v exp( max h E(v, h))) v exp( min h E(v, h)) Q n i= apple exp(min Q m h i c i h i ) j= Q ( + exp(b j + n i= min h i w ij h i )) n i= exp(max Q h i c i h i ) m j= ( + exp(b j + n i= max h i w ij h i )) n my ( + exp(b j + n i= =exp( c i ) w iji [,0] (w ij ))) ( + exp(b i j= j + n i= w iji [0,+] (w ij ))) my =exp( c) f j. with c = n i c i and f j = ( + exp(b j + n i= w iji [,0] (w ij ))) ( + exp(b j + n i= w iji [0,+] (w ij ))) = sig( b n j i= w iji [0,+] (w ij )) sig( b n j i= w iji [,0] (w ij )) = min h (v j = h) max h (v j = h), where sig(x) =. Therefore, f +exp( x) j is just the fraction of the minimal and maximal activation of the jth visible unit of the RBM. j= 3

24 Thus, we get Gap( ) exp( c) 8 (N +) 4 exp( c) =min n+0 (N +) 4 my fj min, exp( max( n+ t+ t ) ) t j= my fj, exp( c max t ( t+ t ) ) my f 8 (N +) 4 j. j= It is possible to repeat the proof while changing the roles of hidden and visible variables, which finally leads to (6) and proves our main result. 6. Discussion Theorem and corollary are the first non-trivial results on the convergence rate of T for sampling from binary RBMs. As shown in corollary, they lead to an upper bound on the approximation error when applying the analyzed T method to gradient estimation in RBM training. The upper bound depends exponentially on the size of one layer and the absolute values of the RBM parameters. The fewer the number of nodes and/or the smaller the amount of the parameters, the faster the convergence. This intuitive result resembles the bounds on the approximation bias in contrastive divergence learning [7]. Actually, our current hypothesis is that RBM T chains are in general not rapidly mixing, and that it is in general not possible to get rid of the exponential dependencies in the RBM sampling complexity. This is the case for some related models such as variants of the otts model [6, ]. Because our analysis considers the convergence to the stationary distribution of the product chain consisting of all replica chains, we get an undesired additional linear dependency on the number of replicas. However, such a dependency seems to be inevitable for this type of analysis, and we are not aware of any approach bounding the original chain. The terms f j, j =,...,m,andg i, i =,...,n,arethefractionsof the minimal and maximal probability to be activated of visible neuron j and hidden neuron i given states of the respective other layer. They are measures of how strongly neurons vary with changing input. The observed dependency is in accordance with the expectation that highly variable chains require more samples to reach the stationary distribution. Furthermore, the minimal and maximal probabilities to be activated give information about how close the 4 j=

25 conditional distributions are to being uniformly distributed. Gibbs sampling relies on these conditional distributions, and it is optimal if all neurons are activated with probability 0.5 (itreachesthestationarydistributioninonly one sampling step in this case) and gets more and more deterministic as closer these probabilities get to zero or one and, thus, mixing slows down as our bound indicates. When inspecting the role of the inverse temperatures in (6), we see that the term max t ( t+ t )getsminimal,if 0,..., N are uniformly distributed between 0 and. In this case max t ( t+ t )=.Thistheoreticalresultis in N accordance with the heuristic choice of a uniform inverse temperature spacing as used in many applications (e.g., see [9, ]). So far, to our knowledge, the only theoretically justified heuristic for spacing the inverse temperatures in the tempered transitions literature is choose the 0,..., N geometrically (i.e., i+/ i =const)[3,4]. Geometrictemperaturespacinghasbeen shown to be optimal for sampling from Gaussian distributions by Neil [3], and this result has been extended to a more general class of distribution by Behrens et al. [4]. However, RBMs do not fall into this class and a geometric spacing is sub-optimal for sampling RBMs [5]. Desjardins et al. suggest to use an adaptation scheme that dynamically adjusts the inverse temperatures during RBM training (starting from a uniform distribution). In their experiments, the higher inverse temperatures are adapted to values that are much closer together than a geometric spacing would suggest [5], which also supports rather a uniform than a geometric static inverse temperature spacing. We regard the T variant we analyzed as the typical algorithm considered in theoretical analyses of T, for instance by Madras and Zheng [4] and Woodard et al. []. Still, it di ers from T most often used for RBMs in practice. However, as mentioned in section.4, in this T variant the corresponding transition operator is not reversible because of several reasons. Firstly, it consists of an update step in which blockwise Gibbs sampling is performed in all tempered chains. This is not reversible due to the fixed order of first sampling h k+ based on p(h v k )followedbysamplingv k+ based on p(v h k+ ), where the superscripts denote the sampling steps. Secondly, the swapping is usually organized in two substeps, where even temperatures are considered in the first and odd temperatures in the second substep. And, last but not least, we have = TQ instead of = QT Q. Lacking reversibility prohibits our type of analysis, and we are not aware of other strategies not requiring the property. Of course, one can design T 5

26 operators that are closer to the one used for RBM sampling in practice but still reversible. For example, we can define = QT Q,chooseT to perform an update of all visible or all hidden variables (each picked with probability 0.5) in all tempered chains simultaneously, and Q to trying to swap with equal probability either all even or all odd chains in parallel. We have adapted our analysis to this T variant. However, the resulting bound was actually less tight (among others due to the fact that T is not defined as an product chain anymore, but is of the form T = Q N t=0 T t,andthustheorem5cannot be applied) and the proof did not provide additional insights. 7. Conclusion arallel tempering (T) is state-of-the-art for sampling restricted Boltzmann machines (RBMs). We presented a first analysis of the convergence rate of the Markov chains of the T algorithm as considered by Madras and Zheng [4] and Woodard et al. [] for sampling RBMs by deriving an arguably loose, but non-trivial bound on the spectral gap. We find an exponential dependency on the maximum size of the two layers and the absolute values of the RBM parameters in accordance with existing bounds on the approximation bias in contrastive divergence learning. We hypothesise that RBM T chains are in general not rapidly mixing, and one can in general not get rid of the exponential dependencies on the number of variables in the RBM sampling complexity. We regard proving conditions under which RBMs are torpid mixing as an interesting question for future research. Our bound on the spectral gap gets minimal if the inverse temperatures are spaced uniformly, which is a common choice in practice. We do not claim that this spacing is optimal, in particular an adaptive spacing seems to be a good idea. Still, the result matches the practical experience that a uniform spacing (either fixed or as starting point for adaptation) is preferable to a geometric one when using T for sampling RBMs. The bound on the convergence rate naturally leads to a bound on the gradient approximation error of the method when used during RBM training, which resembles bounds on the approximation error of contrastive divergence learning [7]. 6

27 References []. Smolensky, Information rocessing in Dynamical Systems: Foundations of Harmony Theory, in: D. E. Rumelhart, J. L. McClelland (Eds.), arallel Distributed rocessing: Explorations in the Microstructure of Cognition, vol. : Foundations, MIT ress, 94 8, 986. [] G. E. Hinton, Training roducts of Experts by Minimizing Contrastive Divergence, Neural Computation 4 (00) [3] A. Fischer, C. Igel, Training Restricted Boltzmann Machines: An Introduction, attern Recognition 47 (03) [4] G. E. Hinton, R. R. Salakhutdinov, Reducing the Dimensionality of Data with Neural Networks, Science 33 (5786) (006) [5] A. Fischer, C. Igel, Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines, in: K. Diamantaras, W. Duch, L. S. Iliadis (Eds.), International Conference on Artificial Neural Networks (ICANN 00), vol of LNCS, Springer-Verlag, 08 7, 00. [6] Y. Bengio, O. Delalleau, Justifying and Generalizing Contrastive Divergence, Neural Computation (6) (009) [7] A. Fischer, C. Igel, Bounding the Bias of Contrastive Divergence Learning, Neural Computation 3 (0) [8] C. J. Geyer, Markov chain Monte Carlo maximum likelihood, in: E. Kerami (Ed.), roceedings of the 3rd Symposium on the Interface of Computing Science and Statistics, Interface Foundation of North America, 56 63, 99. [9] R. Salakhutdinov, Learning in Markov Random Fields using Tempered Transitions, in: Y. Bengio, D. Schuurmans, J. La erty, C. K. I. Williams, A. Culotta (Eds.), Advances in Neural Information rocessing Systems (NIS ), , 009. [0] G. Desjardins, A. Courville, Y. Bengio,. Vincent, O. Dellaleau, arallel Tempering for Training of Restricted Boltzmann Machines, Journal of Machine Learning Research Workshop and Conference roceedings 9 (AISTATS 00) (00)

28 [] K. Cho, T. Raiko, A. Ilin, arallel tempering is e cient for learning restricted Boltzmann machines, in: roceedings of the International Joint Conference on Neural Networks (IJCNN 00), IEEE ress, , 00. [] D. B. Woodard, S. C. Schmidler, M. Huber, Conditions for Rapid Mixing of arallel and Simulated Tempering on Multimodal Distributions, The Annals of Applied robability 9 (009) [3]. Diaconis, L. Salo -Coste, What do we know about the Metropolis algorithm?, Journal of Computer and System Sciences 57 (998) [4] N. Madras, Z. Zheng, On the swapping algorithm, Random Structures Algorithms (003) [5] K. Brügge, A. Fischer, C. Igel, The flip-the-state transition operator for Restricted Boltzmann Machines, Machine Learning 3 (03) [6] N. Bhatnagar, D. Randall, Torpid Mixing of Simulated Tempering on the otts Model, in: roceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Society for Industrial and Applied Mathematics, hiladelphia, A, USA, , 004. [7] S. Caracciolo, A. elissetto, A. D. Sokal, Two remarks on simulated tempering, Unpublished manuscript. [8] N. Madras, D. Randall, Markov chain decomposition for convergence rate analysis, The Annals of Applied robability (58 606). [9]. Diaconis, L. Salo -Coste, Comparison theorems for reversible Markov chains, The Annals of Applied robability 3 (993) [0]. Brémaud, Markov chains: Gibbs fields, Monte Carlo simulation, and queues, Springer-Verlag, 999. []. Diaconis, L. Salo -Coste, Logarithmic Sobolev inequalities for finite Markov chains, The Annals of Applied robability 6 (996) [] D. Woodard, S. Schmidler, M. Huber, Su cient Conditions for Torpid Mixing of arallel and Simulated Tempering, Electronic Journal of robability 4 (009)

29 [3] R. Neal, Sampling from multimodal distributions using tempered transitions, Statistics and Computing 6 (996) [4] G. Behrens, N. Friel, M. Hurn, Tuning tempered transitions, Statistics and Computing () (0) [5] G. Desjardins, A. Courville, Y. Bengio, Adaptive arallel Tempering for Stochastic Maximum Likelihood Learning of RBMs, in: NIS 00 Workshop on Deep Learning and Unsupervised Feature Learning, 00. Appendix A. Bounding the bias of a MCMC-estimate in terms of the convergence rate The convergence rate of a Markov chain can be used to bound the bias of an approximation of the expected value of a function under the model distribution based on samples from the chain. This is demonstrated in the following. Consider a Markov chain on the state space. Let be the stationary distribution of the chain and k (z, ) be the distribution obtained by running the Markov chain for k steps starting from a fixed sample z. Furthermore, let t :! R be an arbitrary function. Then, a bound on the bias of an estimate that approximates the expected value of t(x) under based on samples from k (z, ) is given by (x)t(x) x k (z, x))t(x) = x apple ( (x) x x ( (x) k (z, x)t(x) k (z, x))t(x) apple max t(x) (x) k (z, x x = max t(x) k k (z, )k where k k (z, )k is the variation distance, which can be bounded, for example, using equation (3). 9

30 Appendix B. Relation between the convergence rates of the T product chain and the original chain In the following, we proof that bounding the convergence rate of the T product chain also bounds the convergence rate of the original chain (i.e., the chain with inverse temperature N = ). Assume that for the product chain holds NY k (y, x) t (x [t] ) <, x t=0 for an arbitrary starting point y. Thenwecanwrite k (y, x) x [ N] x [N] NY t (x [t] ) <, t=0 which implies > x [N] x [ N] k (y, x) NY t (x [t] ) t=0 x [N] x [ N] ( k (y, x) NY t (x [t] )). t=0 Thus, also upper bounds the variation distance for the original chain at temperature : x [N] NY ( k (y, x) t (x [t] )) = k (y, x) x [ N] t=0 x [N] x [ N] = NY k N(y, x [0] ) t (x [t] ) x [N] t=0 x [ N] x [ N] t=0 NY t (x [t] ) = NY k N(y, x [0] ) N (x [0] ) t (x [t] ) x [N] x [ N] t=0 = N(y, k x [0] ) N (x [0] ) < x [N] 30

Introduction to Restricted Boltzmann Machines

Introduction to Restricted Boltzmann Machines Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,

More information

The flip-the-state transition operator for restricted Boltzmann machines

The flip-the-state transition operator for restricted Boltzmann machines Mach Learn 2013) 93:53 69 DOI 10.1007/s10994-013-5390-3 The flip-the-state transition operator for restricted Boltzmann machines Kai Brügge Asja Fischer Christian Igel Received: 14 February 2013 / Accepted:

More information

Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines

Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines Asja Fischer and Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum,

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient

More information

Deep Belief Networks are compact universal approximators

Deep Belief Networks are compact universal approximators 1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation

More information

Parallel Tempering is Efficient for Learning Restricted Boltzmann Machines

Parallel Tempering is Efficient for Learning Restricted Boltzmann Machines Parallel Tempering is Efficient for Learning Restricted Boltzmann Machines KyungHyun Cho, Tapani Raiko, Alexander Ilin Abstract A new interest towards restricted Boltzmann machines (RBMs) has risen due

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines

Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines KyungHyun Cho KYUNGHYUN.CHO@AALTO.FI Tapani Raiko TAPANI.RAIKO@AALTO.FI Alexander Ilin ALEXANDER.ILIN@AALTO.FI Department

More information

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions - Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions Simon Luo The University of Sydney Data61, CSIRO simon.luo@data61.csiro.au Mahito Sugiyama National Institute of

More information

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous

More information

Knowledge Extraction from DBNs for Images

Knowledge Extraction from DBNs for Images Knowledge Extraction from DBNs for Images Son N. Tran and Artur d Avila Garcez Department of Computer Science City University London Contents 1 Introduction 2 Knowledge Extraction from DBNs 3 Experimental

More information

Learning Deep Architectures

Learning Deep Architectures Learning Deep Architectures Yoshua Bengio, U. Montreal Microsoft Cambridge, U.K. July 7th, 2009, Montreal Thanks to: Aaron Courville, Pascal Vincent, Dumitru Erhan, Olivier Delalleau, Olivier Breuleux,

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Mathias Berglund, Tapani Raiko, and KyungHyun Cho Department of Information and Computer Science Aalto University

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

Learning Deep Boltzmann Machines using Adaptive MCMC

Learning Deep Boltzmann Machines using Adaptive MCMC Ruslan Salakhutdinov Brain and Cognitive Sciences and CSAIL, MIT 77 Massachusetts Avenue, Cambridge, MA 02139 rsalakhu@mit.edu Abstract When modeling high-dimensional richly structured data, it is often

More information

Restricted Boltzmann Machines

Restricted Boltzmann Machines Restricted Boltzmann Machines Boltzmann Machine(BM) A Boltzmann machine extends a stochastic Hopfield network to include hidden units. It has binary (0 or 1) visible vector unit x and hidden (latent) vector

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber Representational Power of Restricted Boltzmann Machines and Deep Belief Networks Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber Introduction Representational abilities of functions with some

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University

More information

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines Christopher Tosh Department of Computer Science and Engineering University of California, San Diego ctosh@cs.ucsd.edu Abstract The

More information

Deep Boltzmann Machines

Deep Boltzmann Machines Deep Boltzmann Machines Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel University of Illinois Urbana Champaign agoel10@illinois.edu December 2, 2016 Ruslan Salakutdinov and Geoffrey E. Hinton Amish

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Markov Chains and MCMC

Markov Chains and MCMC Markov Chains and MCMC Markov chains Let S = {1, 2,..., N} be a finite set consisting of N states. A Markov chain Y 0, Y 1, Y 2,... is a sequence of random variables, with Y t S for all points in time

More information

arxiv: v1 [cs.ne] 6 May 2014

arxiv: v1 [cs.ne] 6 May 2014 Training Restricted Boltzmann Machine by Perturbation arxiv:1405.1436v1 [cs.ne] 6 May 2014 Siamak Ravanbakhsh, Russell Greiner Department of Computing Science University of Alberta {mravanba,rgreiner@ualberta.ca}

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Restricted Boltzmann Machines for Collaborative Filtering

Restricted Boltzmann Machines for Collaborative Filtering Restricted Boltzmann Machines for Collaborative Filtering Authors: Ruslan Salakhutdinov Andriy Mnih Geoffrey Hinton Benjamin Schwehn Presentation by: Ioan Stanculescu 1 Overview The Netflix prize problem

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

16 : Approximate Inference: Markov Chain Monte Carlo

16 : Approximate Inference: Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models 10-708, Spring 2017 16 : Approximate Inference: Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Yuan Yang, Chao-Ming Yen 1 Introduction As the target distribution

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

Machine Learning for Data Science (CS4786) Lecture 24

Machine Learning for Data Science (CS4786) Lecture 24 Machine Learning for Data Science (CS4786) Lecture 24 Graphical Models: Approximate Inference Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016sp/ BELIEF PROPAGATION OR MESSAGE PASSING Each

More information

Greedy Layer-Wise Training of Deep Networks

Greedy Layer-Wise Training of Deep Networks Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn

More information

Markov Networks.

Markov Networks. Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function

More information

Contrastive Divergence

Contrastive Divergence Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer Science TU Graz June 9, 2010 Contents 1 Theory 2 Argument 3 Contrastive

More information

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo Group Prof. Daniel Cremers 11. Sampling Methods: Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

Decomposition Methods and Sampling Circuits in the Cartesian Lattice

Decomposition Methods and Sampling Circuits in the Cartesian Lattice Decomposition Methods and Sampling Circuits in the Cartesian Lattice Dana Randall College of Computing and School of Mathematics Georgia Institute of Technology Atlanta, GA 30332-0280 randall@math.gatech.edu

More information

Mixture Models and Representational Power of RBM s, DBN s and DBM s

Mixture Models and Representational Power of RBM s, DBN s and DBM s Mixture Models and Representational Power of RBM s, DBN s and DBM s Guido Montufar Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany. montufar@mis.mpg.de Abstract

More information

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms

More information

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II. by Vasileios Vasilakakis Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

More information

Chapter 20. Deep Generative Models

Chapter 20. Deep Generative Models Peng et al.: Deep Learning and Practice 1 Chapter 20 Deep Generative Models Peng et al.: Deep Learning and Practice 2 Generative Models Models that are able to Provide an estimate of the probability distribution

More information

Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

More information

Density estimation. Computing, and avoiding, partition functions. Iain Murray

Density estimation. Computing, and avoiding, partition functions. Iain Murray Density estimation Computing, and avoiding, partition functions Roadmap: Motivation: density estimation Understanding annealing/tempering NADE Iain Murray School of Informatics, University of Edinburgh

More information

Probabilistic Models in Theoretical Neuroscience

Probabilistic Models in Theoretical Neuroscience Probabilistic Models in Theoretical Neuroscience visible unit Boltzmann machine semi-restricted Boltzmann machine restricted Boltzmann machine hidden unit Neural models of probabilistic sampling: introduction

More information

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13 Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 13 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 1 / 21 Summary Story so far Representation: Latent

More information

Deep Generative Models. (Unsupervised Learning)

Deep Generative Models. (Unsupervised Learning) Deep Generative Models (Unsupervised Learning) CEng 783 Deep Learning Fall 2017 Emre Akbaş Reminders Next week: project progress demos in class Describe your problem/goal What you have done so far What

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

An Empirical Investigation of Minimum Probability Flow Learning Under Different Connectivity Patterns

An Empirical Investigation of Minimum Probability Flow Learning Under Different Connectivity Patterns An Empirical Investigation of Minimum Probability Flow Learning Under Different Connectivity Patterns Daniel Jiwoong Im, Ethan Buchman, and Graham W. Taylor School of Engineering University of Guelph Guelph,

More information

Lect4: Exact Sampling Techniques and MCMC Convergence Analysis

Lect4: Exact Sampling Techniques and MCMC Convergence Analysis Lect4: Exact Sampling Techniques and MCMC Convergence Analysis. Exact sampling. Convergence analysis of MCMC. First-hit time analysis for MCMC--ways to analyze the proposals. Outline of the Module Definitions

More information

Kyle Reing University of Southern California April 18, 2018

Kyle Reing University of Southern California April 18, 2018 Renormalization Group and Information Theory Kyle Reing University of Southern California April 18, 2018 Overview Renormalization Group Overview Information Theoretic Preliminaries Real Space Mutual Information

More information

Inference in Bayesian Networks

Inference in Bayesian Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,

More information

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines COMP9444 17s2 Boltzmann Machines 1 Outline Content Addressable Memory Hopfield Network Generative Models Boltzmann Machine Restricted Boltzmann

More information

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017 Algorithms other than SGD CS6787 Lecture 10 Fall 2017 Machine learning is not just SGD Once a model is trained, we need to use it to classify new examples This inference task is not computed with SGD There

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient

More information

Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines

Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines Desjardins, G., Courville, A., Bengio, Y., Vincent, P. and Delalleau, O. Technical Report 1345, Dept. IRO, Université de

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

arxiv: v1 [cs.lg] 8 Oct 2015

arxiv: v1 [cs.lg] 8 Oct 2015 Empirical Analysis of Sampling Based Estimators for Evaluating RBMs Vidyadhar Upadhya and P. S. Sastry arxiv:1510.02255v1 [cs.lg] 8 Oct 2015 Indian Institute of Science, Bangalore, India Abstract. The

More information

Exponential families also behave nicely under conditioning. Specifically, suppose we write η = (η 1, η 2 ) R k R p k so that

Exponential families also behave nicely under conditioning. Specifically, suppose we write η = (η 1, η 2 ) R k R p k so that 1 More examples 1.1 Exponential families under conditioning Exponential families also behave nicely under conditioning. Specifically, suppose we write η = η 1, η 2 R k R p k so that dp η dm 0 = e ηt 1

More information

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling Professor Erik Sudderth Brown University Computer Science October 27, 2016 Some figures and materials courtesy

More information

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm Markov Chain Monte Carlo The Metropolis-Hastings Algorithm Anthony Trubiano April 11th, 2018 1 Introduction Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability

More information

6 Markov Chain Monte Carlo (MCMC)

6 Markov Chain Monte Carlo (MCMC) 6 Markov Chain Monte Carlo (MCMC) The underlying idea in MCMC is to replace the iid samples of basic MC methods, with dependent samples from an ergodic Markov chain, whose limiting (stationary) distribution

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Sampling Methods (11/30/04)

Sampling Methods (11/30/04) CS281A/Stat241A: Statistical Learning Theory Sampling Methods (11/30/04) Lecturer: Michael I. Jordan Scribe: Jaspal S. Sandhu 1 Gibbs Sampling Figure 1: Undirected and directed graphs, respectively, with

More information

Machine Learning. Probabilistic KNN.

Machine Learning. Probabilistic KNN. Machine Learning. Mark Girolami girolami@dcs.gla.ac.uk Department of Computing Science University of Glasgow June 21, 2007 p. 1/3 KNN is a remarkably simple algorithm with proven error-rates June 21, 2007

More information

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen 206-09-22 Jakob Verbeek & Daan Wynen Unsupervised Neural Networks Outline Autoencoders Restricted) Boltzmann

More information

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash Equilibrium Price of Stability Coping With NP-Hardness

More information

Learning and Evaluating Boltzmann Machines

Learning and Evaluating Boltzmann Machines Department of Computer Science 6 King s College Rd, Toronto University of Toronto M5S 3G4, Canada http://learning.cs.toronto.edu fax: +1 416 978 1455 Copyright c Ruslan Salakhutdinov 2008. June 26, 2008

More information

ARestricted Boltzmann machine (RBM) [1] is a probabilistic

ARestricted Boltzmann machine (RBM) [1] is a probabilistic 1 Matrix Product Operator Restricted Boltzmann Machines Cong Chen, Kim Batselier, Ching-Yun Ko, and Ngai Wong chencong@eee.hku.hk, k.batselier@tudelft.nl, cyko@eee.hku.hk, nwong@eee.hku.hk arxiv:1811.04608v1

More information

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018 Graphical Models Markov Chain Monte Carlo Inference Siamak Ravanbakhsh Winter 2018 Learning objectives Markov chains the idea behind Markov Chain Monte Carlo (MCMC) two important examples: Gibbs sampling

More information

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric

More information

Modeling Documents with a Deep Boltzmann Machine

Modeling Documents with a Deep Boltzmann Machine Modeling Documents with a Deep Boltzmann Machine Nitish Srivastava, Ruslan Salakhutdinov & Geoffrey Hinton UAI 2013 Presented by Zhe Gan, Duke University November 14, 2014 1 / 15 Outline Replicated Softmax

More information

Annealing Between Distributions by Averaging Moments

Annealing Between Distributions by Averaging Moments Annealing Between Distributions by Averaging Moments Chris J. Maddison Dept. of Comp. Sci. University of Toronto Roger Grosse CSAIL MIT Ruslan Salakhutdinov University of Toronto Partition Functions We

More information

arxiv: v2 [cs.ne] 22 Feb 2013

arxiv: v2 [cs.ne] 22 Feb 2013 Sparse Penalty in Deep Belief Networks: Using the Mixed Norm Constraint arxiv:1301.3533v2 [cs.ne] 22 Feb 2013 Xanadu C. Halkias DYNI, LSIS, Universitè du Sud, Avenue de l Université - BP20132, 83957 LA

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Unsupervised Learning

Unsupervised Learning CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

Gaussian-Bernoulli Deep Boltzmann Machine

Gaussian-Bernoulli Deep Boltzmann Machine Gaussian-Bernoulli Deep Boltzmann Machine KyungHyun Cho, Tapani Raiko and Alexander Ilin Aalto University School of Science Department of Information and Computer Science Espoo, Finland {kyunghyun.cho,tapani.raiko,alexander.ilin}@aalto.fi

More information

Approximation properties of DBNs with binary hidden units and real-valued visible units

Approximation properties of DBNs with binary hidden units and real-valued visible units Approximation properties of DBNs with binary hidden units and real-valued visible units Oswin Krause Oswin.Krause@diku.dk Department of Computer Science, University of Copenhagen, 100 Copenhagen, Denmark

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 8: Autoencoder & DBM Princeton University COS 495 Instructor: Yingyu Liang Autoencoder Autoencoder Neural networks trained to attempt to copy its input to its output Contain

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

First Hitting Time Analysis of the Independence Metropolis Sampler

First Hitting Time Analysis of the Independence Metropolis Sampler (To appear in the Journal for Theoretical Probability) First Hitting Time Analysis of the Independence Metropolis Sampler Romeo Maciuca and Song-Chun Zhu In this paper, we study a special case of the Metropolis

More information

Inductive Principles for Restricted Boltzmann Machine Learning

Inductive Principles for Restricted Boltzmann Machine Learning Inductive Principles for Restricted Boltzmann Machine Learning Benjamin Marlin Department of Computer Science University of British Columbia Joint work with Kevin Swersky, Bo Chen and Nando de Freitas

More information

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection

More information

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture) 10-708: Probabilistic Graphical Models 10-708, Spring 2014 18 : Advanced topics in MCMC Lecturer: Eric P. Xing Scribes: Jessica Chemali, Seungwhan Moon 1 Gibbs Sampling (Continued from the last lecture)

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information