ABoundfortheConvergenceRateofParallel Tempering for Sampling Restricted Boltzmann Machines

Size: px

Start display at page:

Download "ABoundfortheConvergenceRateofParallel Tempering for Sampling Restricted Boltzmann Machines"

Helena Gibbs
5 years ago
Views:

1 ABoundfortheConvergenceRateofarallel Tempering for Sampling Restricted Boltzmann Machines Asja Fischer a,b,christianigel b a Institut für Neuroinformatik, Ruhr-Universität Bochum, Bochum, Germany b Department of Computer Science, University of Copenhagen, 00 Copenhagen, Denmark Abstract Sampling from restricted Boltzmann machines (RBMs) is done by Markov chain Monte Carlo (MCMC) methods. The faster the convergence of the Markov chain, the more e ciently can high quality samples be obtained. This is also important for robust training of RBMs, which usually relies on sampling. arallel tempering (T), an MCMC method that maintains several replicas of the original chain at higher temperatures, has been successfully applied for RBM training. We present the first analysis of the convergence rate of T for sampling from binary RBMs. The resulting bound on the rate of convergence of the T Markov chain shows an exponential dependency on the size of one layer and the absolute values of the RBM parameters. It is minimized by a uniform spacing of the inverse temperatures, which is often used in practice. Similar as in the derivation of bounds on the approximation error for contrastive divergence learning, our bound on the mixing time implies an upper bound on the error of the gradient approximation when the method is used for RBM training. Keywords: restricted Boltzmann machines, parallel tempering, mixing time. Introduction Restricted Boltzmann machines (RBMs) are probabilistic graphical models corresponding to stochastic neural networks [, ] (see [3] for a recent review). They are applied in many machine learning tasks, notably they serve as building blocks of deep belief networks [4]. Markov chain Monte reprint submitted to Theoretical Computer Science September 9, 05

2 Carlo (MCMC) methods are used to sample from RBMs, and chains that quickly converge to their stationary distribution are desirable to e ciently get high quality samples. Adaptation of the RBM model parameters typically corresponds to gradient-based likelihood maximization given training data. As computing the exact gradient is usually computationally not tractable, sampling-based methods are employed to approximate the likelihood gradient. It has been shown that inaccurate approximations can deteriorate the learning process (e.g., [5]), and for the most popular learning scheme contrastive divergence learning (CD, []) the approximation quality has been analyzed [6, 7]. The quality of the approximation depends, among other things, on how quickly the Markov chain approaches the stationary distribution, that is, on its mixing rate. To improve RBM learning, parallel tempering (T, [8]) has successfully been used as a sampling method in RBM training [9, 0,, 3]. arallel tempering introduces supplementary Gibbs chains that sample from smoothed replicas of the original distribution with the goal of improving the mixing rate. However, so far there exist no published attempts to analyze the mixing rate of T applied to RBMs. Based on the work by Woodard et al. [], we provide the first such analysis. After introducing the basic concepts, section 3statesourmainresult. Section4summarizesgeneraltheoremsrequiredfor our proof in section 5, which is followed by a discussion and our conclusions.. Background In the following, we will give a brief introduction to RBMs and the relation between the mixing rate and the spectral gap of a Markov chain. Afterwards we will describe the parallel tempering algorithm and its application to sampling from RBMs... Restricted Boltzmann machines Restricted Boltzmann machines (RBMs) are probabilistic undirected graphical models (Markov random fields). Their structure is a bipartite graph connecting a set of m visible random variables V =(V,V,...,V m )modeling observations to n hidden (latent) random variables H =(H,H,...,H n ) capturing dependencies between the visible variables. In binary RBMs the state space of one single variable is given by = {0, } and accordingly (V, H) {0, } m+n. The joint probability distribution of (V, H) isgiven

3 by the Gibbs distribution with energy E(v, h) = (v, h) = exp( n i= m h i w ij v j j= E(v, h)) Z m j= b j v j, () n c i h i i= and real-valued connection weights w ij and bias parameters b j and c i for i {,...,n} and j {,...,m}. The normalization constant Z, also called the partition function, is given by Z = v,h exp( E(v, h)). Training an RBM means adapting its parameters such that the distribution of V models a distribution underlying some observed data. In practice, this training corresponds to performing stochastic gradient ascent on the loglikelihood of the weight and bias parameters given sample (training) data. The gradient of the log-likelihood given a single training sample v train is given ln (v train ) (h v train train, h) h) h v,h, () where is the vector collecting all parameters. Since the expectation under the model distribution in the second term on the right hand side can not be computed e ciently (it is exponential in min(n, m)), it is approximated by MCMC methods in RBM training algorithms. Typically, the expected value under the model distribution is approximated by (h v h given a sample v (k) obtained by running a Markov chain for k steps. Alternatively, we can consider (v h (k) ) given a sample h (k) to v computation time if m<n... Mixing rates and the spectral gap AhomogeneousMarkovchainonadiscretestatespace canbedescribed by a transition matrix =(p x,y ) x,y,wherep x,y is the probability to move from x to y in one step of the Markov chain. We also refer to this probability as (x, y), and accordingly k (x, y) gives the probability to move from x to y in k steps of the chain. 3

4 Markov chain Monte Carlo methods make use of the fact that an ergodic Markov chain on with transition matrix and equilibrium distribution satisfies k (x, y)! (y) ask!for all x, y. It is important to study the convergence rate of the Markov chain in order to find out how large k has to be to ensure that k (x, y) issuitablycloseto (y). One way to measure the closeness to stationarity is the total variation distance k k (x, ) k = y k (x, y) (y) for an arbitrary starting state x. Reversibility of the chain implies that the eigenvalues of are real-valued, and we sort them by value = r. The value Gap( )= is called the spectral gap of, and the eigenvalue with the second largest absolute value SLEM =max{, r } is often referred to as second largest eigenvalue modulus (SLEM). Many bounds on convergence rates are based on the SLEM. Diaconis and Salo -Coste [3], for example, prove k k (x, ) k apple p (x) SLEMk. (3) If is positive definite all eigenvalues are non-negative. In this case SLEM = and we can replace SLEM in (3) (and similar bounds) by Gap( ). We can make an arbitrary transition matrix Q positive definite by skipping the move each time with probability, that is, by considering the transition matrix = I + Q. This makes in the long run two times slower than Q. AtypicalapplicationofMCMCmethods(e.g.,inthetrainingofRBMs) is to approximate the expected value of a given function under the stationary distribution based on samples obtained after running the Markov chain for k steps. A bound on the chain s convergence rate directly leads to a bound on the bias of this approximation (see Appendix A for a proof)..3. arallel tempering Consider a multi modal target density on a state space from which one would like to sample via a Markov chain. The transition steps of the Metropolis-Hastings algorithm move only locally in space so that the resulting Markov chain may move between the modes of only infrequently. The T algorithm [8] tries to overcome this by introducing supplementary Markov chains, which are flattened or smoothed versions of. These (and the original density) are referred to as tempered densities t, t =0,...,N fulfilling t (z) / (z) t for z withinverse temperatures t. The inverse 4

5 temperatures satisfy 0 apple 0 < < N =. aralleltemperingnowconstructs a Markov chain on the joint state space T = N+ of the tempered distributions. The chain takes values x =(x [0],...,x [N] ) T and converges to the stationary distribution NY T (x) = t (x [t] ). t=0 The T transition matrix consists out of two components, one for updating the states of the individual tempered chains and one allowing swaps between states of adjacent tempered chains. We will denote the corresponding transition matrices by T and Q, respectively. Fortheconsiderationsin this paper, we add a probability of of staying in the current state each time T or Q is applied to guarantee positive definiteness and thus positive eigenvalues. We define the T matrix as = QT Q. Atransitionmatrixconstructed like this assures reversibility (unlike TQ)andisusedintheanalysis by Madras and Zheng [4] and Woodard et al. []. Here the matrices T and Q are defined as follows: T chooses t {0,...,N} uniformly and updates the state of t based on some transition matrix T t,whichisreversiblewithrespectto t. This can, for example, be the Metropolis-Hastings or a Gibbs matrix [5]. Thus, the probability of moving from x =(x [0],...,x [N] ) T to y =(y [0],...,y [N] ) T under T is given by T (x, y) = (N +) N T t (x [t], y [t] )I {x[ t] }(y [ t] )+ I {x}(y), t=0 where x [ t] =(x [0],...,x [t ], x [t+],...,x [N] ) N collects all components of x except the t-th one and, for any set B, I B (y) =ify B and I B (y) =0, otherwise. The I {x}(y)ensuresnon-negativeeigenvaluesasdiscussedabove. The swapping matrix Q proposes to swap state x [t] and x [t+] by sampling t uniformly from {0,...,N }. The proposed swap is accepted with the Metropolis probability a(x,t)=min, t(x [t+] ) t+ (x [t] ) t (x [t] ) t+ (x [t+] ), (4) which guarantees detailed balance. So the transition probability from x = (x [0],...,x [N] )toy =(x [0],...,x [t+], x [t],...x [N] )isgivenby Q(x, y) = N a(x,t), 5

6 the probability to stay in x is Q(x, x) = N t=0 N a(x,t), and the transition probability between all other states is 0. By construction both T and Q are reversible with respect to T and strictly positive definite due to their holding probability. It follows from the definition of that both properties also hold for. Thus, has only positive real-valued eigenvalues and the convergence of the T chain to the stationary distribution T can be bounded in terms of (3) by replacing SLEM by, where is a lower bound for Gap( )..4. Training RBMs with T The T algorithm is often the sampling procedure of choice for RBM learning algorithms to approximate the gradient of the log-likelihood given in equation ()[0, ]. In this setting, the tempered distributions are defined as t (v, h) = exp( E(v, h) t) Z t (5) with Z t = v,h exp( E(v, h) t)and 0 is typically set to 0, which we will also assume in our analysis. The transition matrices T t for the single tempered chains are set to a randomized version of blockwise Gibbs sampling: we choose to update either V or H with equal probability and then update all neurons in this layer simultaneously. The random choice of the layer leads to reversibility of T t. In this paper, we want to find a lower bound for the spectral gap of T sampling as defined above applied to RBMs. In the T algorithm most often used for RBM training in practice the transition operator di ers from the one analyzed in this paper. It consists out of an update step followed by a swapping step. During the update step, one step of blockwise Gibbs sampling is performed in all tempered chains in parallel, where one step of blockwise Gibbs sampling corresponds to sampling a new state h 0 of H based on the conditional probability (h 0 v) giventhecurrentstateofthevisible variables v, followedbysamplinganewstatev 0 of V based on (v 0 h 0 ). During the swapping step, the states at all temperatures may be swapped based on the Metropolis probability a(x,t)asdefinedin(4),withx =(v, h). The swapping is often organized in two substeps, where even temperatures 6

7 are considered in the first and odd temperatures in the second substep. While this approach seems to work in practice, the corresponding transition matrix need not be reversible, which makes it di cult to obtain theoretical results on its mixing behavior, see our discussion in section Main Result Let denote the Gibbs distribution of an RBM as defined in () and for 0 = 0 < < N = let t be the corresponding tempered Gibbs distribution with inverse temperature t as given in (5). Then the stationary distribution of T sampling for RBMs as defined above is given by T ((x [0],...,x [N] )) = Q N t=0 t(x [t] )andforthespectralgapofthecorresponding transition matrix it holds: Theorem. For the T transition matrix of an RBM with m visible and n hidden variables exp( c) my Gap( ) min f n+0 (N +) 4 j, exp( c max t ( t+ t ) ) my f 8 (N +) 4 j exp( b) m+0 (N +) 4 j= ny i= g i, exp( b max t ( t+ t ) ) 8 (N +) 4 j= ny gi. i= (6) with = i,j w ij + j b j + i c i, being the sum over the absolute values of parameters of the RBM, b = j b j and c = i c i being the sum of over the absolute values of the bias parameters of the visible and hidden variables, respectively, and f j = min h (v j = h) max h (v j = h) and g i = min v (h i = v) max v (h i = v) being the fraction of the minimal and maximal activation of the j-th visible and the i-th hidden neuron. 7

8 The proof is given in section 5. By combining this result with equation (3), we arrive at an upper bound on the rate of convergence of the T chain. Corollary. Let be the T transition matrix for an RBM with m visible and n hidden variables and let T be the joint distribution over all tempered chains. Then for any starting point of the Markov chain x =(x [0],...,x [N] ), with x [i] =(v [i], h [i] ), the distance in variation to T is bounded by k k (x, ) T kapple p T (x) ( )k (7) with exp( c) =min n+0 (N +) 4 exp( b) m+0 (N +) 4 my j= ny i= and, b, c, f j, and g i as defined above. f j, exp( c max t ( t+ t ) ) 8 (N +) 4 g i, exp( b max t ( t+ t ) ) 8 (N +) 4 Theorem bounds the gap of the T product chain. As the original chain (the chain with inverse temperature N) neverconvergesslowerthan this product chain (see Appendix B for a proof), corollary also leads to an upper bound for the convergence rate of the chain of interest. Bounding the product chain leads to the dependency on the number N of supplementary tempered chains. However, to our knowledge all approaches analyzing T su er from this drawback (e.g., [, 4, 6]). Now, we can make use of the fact that a bound on the convergence rate implies a bound on the bias of an MCMC estimate and obtain a bound on the bias of the T based estimator when used in RBM training. Recall that for training RBMs the log-likelihood gradient given in equation () is approximated by replacing the expected value under the model distribution in the second term, (v, (h v (k) for v,h h asamplev (k) obtained by running the T-chain for k steps. The value of (h v (k) is bounded by and thus, using the result in h Awithmaxt(x) =,wegetthefollowingcorollary: my j= ny i= f j g i 8

9 Corollary. Given an RBM with m visible and n hidden variables and a T chain starting from x =(x [0],...,x [N] ), with x [i] =(v [i], h [i] ). Let T(x (k) ) be the approximation of the log-likelihood derivative w.r.t some RBM parameter obtained by estimating the second term in equation () based on a T-sample x (k) =(x (k) [0],...,x(k) [N]), with x(k) [i] =(v (k) [i], h(k) [i] ), gained by running the chain for k steps. That is, for a training sample v train T(x (k) )= (h v train train, h) + )@E(v(k) (h v (k) [0], h h (8) Then we can bound the absolute value of the bias of T(x (k) ) ln (v train x (k) k (x, x (k) ) T (x (k) ) apple where is defined as in corollary. p T (x) ( )k, (9) Similar considerations lead to an alternative proof of the approximation error of contrastive divergence learning [7] by combining the result in Appendix A with standard bounds on the periodic Gibbs sampler (e.g. [0], page 89). 4. Bounding the Spectral Gap of T This section summarizes definitions and results required for the proof of theorem. We state important results for bounding the spectral gap of general Markov chains and recall the link between the SLEM and Dobrushin s coe cient. 4.. Definitions For any transition matrix reversible with respect to a distribution and any subset A of the state space of,therestriction of to A is defined as: A (x,b)= (x,b)+i B (x) (x,a c ) (0) for x A, B A. InthiswaytransitionsthatwouldleaveA are prohibited and the probabilities to stay in A are increased correspondingly. 9

10 Given a partition A = {A j : j =,...,J} of a finite such that (A j )= xa j (x) > 0 for all j, theprojection matrix of with respect to A is given by the transition operator with for i, j {,...,J}. (i, j) = (A i ) xa i ya j (x) (x, y) () 4.. General results Given a partition A of the state space as defined above, the following theorem allows to lower bound the spectral gap of the T transition matrix in terms of the spectral gap of the projection matrix and the minimum over the spectral gaps of the restrictions of to the subsets A j. Theorem. Let be a transition matrix reversible with respect to a distribution on a state space. Let {A j : i =,...,J} be any partition of such that (A j ) > 0 for all j. Define Aj as in (0) and as in (). If is nonnegative definite, it holds Gap( ) Gap( ) min j=,...,j Gap( A j ). As Woodard et al. [] show this theorem can be proven based on the results from Caracciolo et al. [7] (which were first published in [8]) and the results from Madras and Zheng [4]. Often, one wishes to bound the spectral gap of one chain based on the spectral gap of another chain, which, for example, is easier to estimate. In this case the following theorem can be applied. Combining the theorem for the comparison of the Dirichlet-forms of two Markov chains given by Diaconis and Salo -Coste [9] with Reyleigh s theorem (e.g., [0], p. 05) we get: Theorem 3. Let and Q be transition matrices on a finite state space, reversible with respect to densities and Q respectively. Denote by E = {(x, y) : (x) (x, y) > 0} and E Q = {(x, y) : Q (x)q(x, y) > 0} the edge sets of the corresponding transition graphs. For each pair x 6= y 0

11 such that (x, y) E Q fix a path x,y =(x = x 0, x,...,x k = y) of length x,y = k such that (x i, x i+ ) E for i {0,...,k } and define c = max x,y Q (x)q(x, y). (z,w)e (z) (z, w) x,y:(z,w) x,y Then it holds Gap(Q) apple c Gap( ). If both chains are reversible with respect to the same distribution, the following theorem proven in [] can be applied: Theorem 4. Let and Q be transition matrices on a state space reversible with respect. If Q(x,A\{x}) apple (x,a\{x}) for every x and every A then Gap(Q) apple Gap( ). Recall that the T transition matrix T combines N +transitionoperators T 0,...,T N by taking values x =(x [0],...,x [N] )inthejointstatespace T and performing a transition by randomly picking t {0,...,N} and sampling a new value x [t] for the t-th chain from T t. Such a Markov chain is called a product chain. The following theorem deals with product chains and gives the dependency between the spectral gap of the product chain and the spectral gaps of all single chains it subsumes ([], lemma 3.): Theorem 5. For any natural number N and t =0,...,N, let t be a t - reversible transition matrix on a state space t. Let be the transition matrix on = Q t t given by (x, y) = N b t t (x [t], y [t] )I {x[ t] }(y [ t] ), x, y t=0 for some set of b t > 0 such that t b t =and where x [t] denotes the t-th component of x (corresponding to the state of the t-th chain) and x [ t] all components except the t-th one. A Markov chain with transition matrix is called a product chain. It is reversible with respect to (x) = Q t t(x [t] ) and Gap( ) = min b t Gap( t ). t=0,...,n The next theorem is a well known result that gives an upper bound on the SLEM (e.g., see [0], p. 37).

12 Theorem 6. The second largest eigenvalue modulus SLEM of a transition matrix =(p x,y ) x,y can be bounded from above by Dobrushin s coe cient: D( )= max p x,z p y,z = min min{p x,z,p y,z } x,y z 5. roof of the main result x,y z For proving the main result we make use of an approach inspired by Woodard et al. []. The basic idea behind this approach is to partition the state space of the T chain into subsets, such that the Markov chain mixes well inside the single subsets at all temperatures and between the subsets at the lowest temperature. To begin with, a suitable partitioning of the RBM state space is needed. For a binary RBM with m visible and n hidden neurons the state space is = {0, } n+m.leta = {A j : j =,..., n } be the partition of where each subset A j contains all states having the same state of the hidden neurons, that is, A j = {(v, h) h = h j } if h j denotes the j- th state of the hidden random variables. Thus, we get n subsets A,...,A n and each A j contains m elements. Using this partition, we can now prove theorem based on the results given in the previous section. Our proof can be divided into five steps, where steps and are taken from Woodard et al. []. The basic thoughts behind the steps of the proof can be outlined as follows. The gap of the T transition matrix depends on (a) how well the single tempered chains mix inside the single subsets A,...,A n,(b)howwell the chain at the highest temperature (i.e. the chain with inverse temperature 0) mixesbetweenthesubsets,and(c)themixingpropertiesbetweenthe chains at the di erent temperatures. In step, Gap( )isboundedinterms of the gaps of two transition matrices, one depending on (a) and the other depending on (b) and (c). The first one is further bounded in steps and 3 and the second one in steps 4. Step 5 puts everything together. 5.. Step : Bounding Gap( ) in terms of Gap( ) and Gap( ) Consider the state x =(x [0],...,x [N] ) N+ of the T chain. Let the signature be the vector s(x) =( 0,..., N) with t = j if x [t] A j for t =0,...,N. Since the partition of consists out of n subsets A j and we have N + temperatures, a signature lives in = {,..., n } N+.

13 For a fixed letusnowdefine = {x N+ : s(x) = }. Then all possible induceapartition{ } of the T-state space T = N+. Let = now be the restriction of to as defined in equation (0). And let denote the projection matrix of with respect to { } as defined in (). Now we can apply theorem and get Gap( ) Gap( )min Gap( ). () 5.. Step : Bounding Gap( ) in terms of min t,j Gap(T t Aj ) Woodard et al. now proceed by bounding Gap( )andgap( )separately. For bounding Gap( )theynotethat,since = QT Q and Q has a holding probability, (x, y) T (x, y) 8x, y. With theorem 4 it 4 follows Gap( ) Gap(T ). As T is a product chain, theorem 5 gives 4 Gap(T )= Thus, it follows (N +) min Gap(T t t ) t (N +) min Gap(T t Aj ). t,j Gap( ) 8(N +) min Gap(T t Aj ). (3) t,j 5.3. Step 3: Bounding min t,j Gap(T t Aj ) We will now drive a lower bound for min t,j Gap(T t Aj ). The transition matrix T t analyzed here for (tempered) RBMs corresponds to randomized blockwise Gibbs sampling as described above. The transition probability from (v, h) to(v 0, h) isgivenby T t ((v, h)), (v 0, h)) = t(v 0 h), and the probability to change the state of the hidden variables is accordingly T t ((v, h)), (v, h 0 )) = t(h 0 v). Based on (0) for the restriction of T t to A j it holds: T t Aj ((v, h j ), (v 0, h j )) = T t ((v, h j ), (v 0, h j )) + I {(v,hj )}((v 0, h j ))T t ((v, h j ),A c j) 3

14 for (v, h j ), (v 0, h j ) A j.thus,forv 6= v 0 T t Aj ((v, h j ), (v 0, h j )) = t(v 0 h j ), and the probability to stay in the same state is T t Aj ((v, h j ), (v, h j )) = t(v h j )+ h Let us denote the SLEM of T t Aj by Tt by Tt.Basedontheorem6itholds SLEM t(h v) = t(v h j )+. and the second largest eigenvalue Gap(T t A j )= T t T t SLEM D(T t Aj ). (4) To upper bound the Dobrushin s coe cient of T t Aj first note that for all v, v 0 : min{t t Aj ((v, h j ), (ˆv, h j )),T t Aj ((v 0, h j ), (ˆv, h j ))} ˆv = t(ˆv h j )+minn t(v h j ), t(v h j )+ o ˆv:ˆv6=v^ˆv6=v 0 n +min t(v 0 h j ), t(v 0 h j )+ o = ˆv t(ˆv h j )=. Now it is easy to see that D(T t Aj )= = and insertion into (4) gives Gap(T t Aj ) D(T t Aj ).Thus,wefinallygetbyinsertioninto equation (3) Gap( ) 6(N +). (5) 5.4. Step 4: Bounding Gap( ) For bounding Gap( ) first note that is reversible with respect to the probability mass function ( )= T ( )= NY t (A t ), 8, t=0 4

15 and that according to definition () for any moving from to under is given by (, )= T ( ) x,, the probability of y T (x) (x, y). (6) We proceed by bounding Gap( )basedontheorem3bycomparing to another -reversible transition matrix T.ThetransitionmatrixT chooses t uniformly from {0,...,N} and then draws t with the probability t (A t ). To ease the notation let us now denote [i,j] =( 0,..., i,j, i+,..., N). Now we can define the transition matrix T by T (, [i,j]) = N + i(a j ) and T (, 0 )=0if8i, j : 0 6= [i,j].fortheapplicationoftheorem3,for each edge (, [i,j]) inthetransitiongraphoft let us define a path, [i,j] in the transition graph of as follows: Stage. change 0 to j; Stage. swap j up to level i; Stage 3. swap new i (formerly i ) down tolevel0; Stage 4. change value at level 0 to 0 (from former i); We derive an upper bound of the constant c of theorem 3 by splitting it into three terms and bounding each term separately. Here c is the maximum with respect to and (with ( ) (, ) > 0) of, [i,j], [i,j] :(, ), [i,j] {z } Term I ( ) ( ) {z } Term II T (, [i,j]). (, ) {z } Term III The product of the three upper bounds on the three terms I, II, and III will be used as an upper bound on c. Bounding Term I. First, we want to find an upper bound for, (7) [i,j], [i,j] :(, ), [i,j] 5

16 for the above-defined paths and for any edge (, ) inthegraphof. Let us start with finding an upper bound for the length, of a [i,j] path. Since i {0,...,N} stage and 3 each can at most consist out of N swapping moves. Stage and 4 each contain only one transition. Thus, applen +=(N +). [i,j] We will now upper bound the number of paths that go through any edge (, ). If the edge corresponds to stage of the path = and = [0,j], and since i {0,...,N} there are not more that N + paths containing this edge. A similar argumentation holds for edges in stage 4. If the edge is in stage of the path =(,..., l,j, l+,..., N) for some l {0,...,i} and =(,..., l, l+,j,..., N). Here, 0 is unknown and has n possible values. With i {0,...,N} there are not more then n (N +)pathscontaininganedgeinstageofthepath. Similarly,there are no more than n (N +) paths containing a certain edge in stage 3 of the path. Since the same edge could either be found in stage or 4 of the path or in stage or 3 each edge can be in two stages, each corresponding to at most n (N +) paths. Therefore, the total number of paths containing a single edge is no more than n+ (N +) = max{n +, n (N +)} and thus, [i,j] applen+ (N +). (8), [i,j] :(, ), [i,j] Bounding Term II. For bounding ( ) first note that any state in the stages ( ) or of the path from to [i,j] as given above is of the form = (,..., l,j, l+,..., N) forsomel {0,...,i}. Therefore, " ly # ( ) ( ) = k (A k ) 0 (A 0 ). (9) k (A k ) l (A j ) k= For the Boltzmann distribution of RBMs we have 0 (A 0 ) l (A j ) = Z l Z 0 v exp( E(v, h ) 0 ) v exp( E(v, h j) l ) and for the first term on the left side of equation (9) ly k= k (A k ) k (A k ) = Z 0 Z l ly k= v exp( E(v, h k ) k) v exp( E(v, h k ) k ). 6

17 Now consider that v exp( E(v, h k ) k) v exp( E(v, h k ) k ) apple because we can write v exp( E(v, h k ) k) v exp( E(v, h k ) k ) = v exp( min h E(v, h) k ) v exp( min h E(v, h) k ) (0) v exp( E(v, h k ) k +max v,h E(v, h)) v exp( E(v, h k ) k +max v,h E(v, h)), which makes all arguments of the exponential function in the terms of denominator and numerator nonnegative. The function k +exp( x k +y) is R k +exp( x k +y) R monotonically decreasing in x for x apple y for k k 0, and for each value of v we can write x = E(v, h k )andy =max v,h E(v, h) and fix R k and R k to the remaining terms in numerator and denominator, respectively. Thus, the expression gets maximal if we replace x by min h E(v, h). This can be done for all values of v. Sowecanwrite: ( ) ( ) apple Z 0 Z l " ly k= = v exp( min h E(v, h) k ) v exp( min h E(v, h) k ) v exp( min h E(v, h) l ) v exp( min h E(v, h) 0 ) 0=0 = # Z l Z 0 v exp( min h E(v, h) l ) v exp( E(v, h apple j) l ) Any state in stage 3 of the path is of the form for some l {0,...,i v exp( E(v, h ) 0 ) v exp( E(v, h j) l ) v exp( E(v, h ) 0 ) v exp( E(v, h j) l ) v exp( min h E(v, h)) v exp( max h E(v, h)) =(,..., l, i, l+,..., i,j, i+,..., N) }. Thus, " ly ( ) ( ) = k= # k (A k ) k (A k ) 0 (A 0 ) i (A i ) i (A j ) l (A i ). 7

18 In an analogous way as above we get ( ) ( ) apple v exp( min h E(v, h) l ) v exp( min h E(v, h) 0 ) v exp( min h E(v, h) 0 ) v exp( E(v, h v exp( min h E(v, h) i ) j) i ) v exp( min h E(v, h) l ) v = exp( min h E(v, h) i ) v exp( E(v, h apple j) i ) Any state in stage 4 is given by =( i,,..., i Therefore, ( ) ( ) = 0(A 0 ) i (A i ) 0 (A i ) i (A j ) = v exp( E(v, h 0) 0 ) v exp( E(v, h i) 0 ) 0=0 = v exp( E(v, h ) i i) v exp( E(v, h j) i ) apple v exp( min h E(v, h)) v exp( max h E(v, h)). v exp( E(v, h ) i i) v exp( E(v, h j) i ),j, i+,..., N). v exp( min h E(v, h)) v exp( max h E(v, h)). Thus, by now we have shown that for all states in the path we have ( ) ( ) apple v exp( min h E(v, h)) v exp( max h E(v, h)). () Bounding Terms III. Now we will bound the remaining term T (, [i,j]).we (, ) have T (, [i,j]) = N + i(a j ) apple N + v exp( min h E(v, h)) n v exp( max h E(v, h)). () For bounding (, ), we consider two cases. Any edge (, ) onthepath [, [i,j]] can be one of the following two types: Case. It is an edge where we get from =( 0,..., N ) by replacing 0 by any new state 0 and, thus, = [0, 0 ] =( 0,,..., N ). Case. It is an edge which we obtain by swapping two elements t and t+ and, thus, =( 0,..., t+, t,..., N ). Let us analyse these two cases separately: 8

19 Case : Based on (6) and by (, ) = T ( ) x y T (x) (x, y) (x, y) Q(x, x)t (x, y)q(x, x) = T (x, y), 4 we get (, ) 4 T ( ) x T (x)t (x, y). y Because we consider the first case, we can restrict the inner sum to y that di er from x only in the state of the lowest chain, since T (x, y) =0 otherwise. That is, for x =(x [0], x [],...,x [N] ) we restrict the inner sum to all y =(y [0], x [],...,x [N] )withy [0] A 0.Inthiscasewehave T (x, y) = (N +) T 0(x [0], y [0] ). Furthermore, for x [0] =(v [0], h [0] )thereisonlyoney [0] A 0 that can be reached by one transition of T 0,namelyy [0] =(v [0], h 0 ) with probability T 0 (v [0], h [0] )= for n 0 =0(probabilityof to sample a new state of the n hidden variables, which are uniformly distributed). And thus (, ) T (x) 4 T ( ) 4(N +) = n 6(N +), n x where we use x T (x) = T ( ). Thus, we get T (, (, ) [i,j]) apple 6 n (N +)T (, [i,j]). Using () this is bounded from above by 4 v exp( min h E(v, h)) v exp( max h E(v, h)). (3) 9

20 Case : For bounding (, ) in the second case, we can make use of (x, y) Q(x, y), because = QT Q and the swap can either occur under the first or the second Q and the probability of holding under T and the other Q is.thus,weget 4 (, ) T (x)q(x, y) T ( ) y x = Q N t=0 t(a ) [t] x y t=0 = 4N t (A [t] ) t+ (A [t+] ) x [t] A [t] NY t (x [t] )Q(x, y) x [t+] A [t+] t (x [t] ) t+ (x [t+] )min, t+(x [t]) t (x [t+]) t (x [t] ) t+ (x [t+] ) where in the last step we made use of the facts that [i] = [i] for i {0,...,t,t+,...,N} and that the probabilities to propose and accept a certain swap according to Q are N and min, t+(x [t] ) t(x [t+] ) t(x [t] ) t+ (x [t+] ),respectively. We further have and x [t] A [t] t (x [t] ) t+ (x [t+] )min x [t+] A [t+] = Z t Z t+ v ˆv, t+(x [t]) t (x [t+]) t (x [t] ) t+ (x [t+] ) exp( E(v, h [t] ) t )exp( E(ˆv, h [t+] ) t+ ) min, exp( E(v, h [t] ) t+ )exp( E(ˆv, h [t+] ) t ) exp( E(v, h [t] ) t )exp( E(ˆv, h [t+] ) t+ ) = exp( E(v, h [t] ) t )exp( E(ˆv, h [t+] ) t+ ) Z t Z t+ v ˆv min{, exp (E(v, h [t] ) E(ˆv, h [t+] ))( t+ t ) } exp (max E(v, h) min Z t Z E(v, h))( t+ t) t+ v,h v,h exp( E(v, h [t] ) t ) ˆv exp( E(ˆv, h [t+] ) t+ ) v 0,

21 t (A [t] ) t+ (A [t+] ) = Z t Z t+ ( v exp( E(v, h [t] ) t ))( ˆv exp( E(ˆv, h [t+] ) t+ )) and thus a bound of (, ) isgivenby (, ) 4N exp max t ( t+ t )(max v,h E(v, h)) min E(v, h)). v,h Using max v,h E(v, h) min E(v, h)) apple, v,h with = i,j w ij + j b j + i c i summing the absolute values of parameters of the RBM, we get (, ) exp( max t ( t+ t ) ) 4N. (4) From (4) and the upper bound for T (, [i,j]) givenin()itfollows in the current case that T (, [i,j]) 4 apple v exp( min h E(v, h)) (, ) n exp( max t ( t+ t ) ) v exp( max h E(v, h)). (5) Combining Terms I, II, III. By joining the results (3) and (5) for term III with the upper bound for term II given in () we get in case ( )T (, [i,j]) ( ) (, ) apple 4 ( v exp( min h E(v, h)) ( v exp( max h E(v, h))) (6) and in case ( )T (, [i,j]) ( ) (, ) 4 ( v apple exp( min h E(v, h))) n exp( max t ( t+ t ) ) ( v exp( max h E(v, h))). (7)

22 If we combine these results for the two cases (6) and (7) we get = 4( v exp( min h E(v, h)) ( v exp( max h E(v, h))) max 4 ( v apple max exp( min h E(v, h)) ( v exp( max h E(v, h))), ( )T (, [i,j]) ( ) (, ) 4 ( v exp( min h E(v, h))) n exp( max t ( t+ t ) ) ( v exp( max h E(v, h))) 4, n exp( max t ( t+ t ) ). (8) utting this and (8) together, we arrive at an upper bound for the constant c from theorem 3: c apple 6(N +) n ( v exp( min h E(v, h))) ( v exp( max h E(v, h))) max, n exp( max t ( t+ t ) ) = 6(N +) ( v exp( min h E(v, h))) ( v exp( max h E(v, h))) max n+, exp( max t ( t+ t ) ) Thus, applying theorem 3 and theorem 5, from which follows that Gap(T )= (N +) because we have b 0 = b = = b N =/(N +) and all component chains of the product chain T have a spectral gap of, we get: Gap( ) c Gap(T )= ( v exp( max h E(v, h))) 6(N +) 3 ( v exp( min h E(v, h))) min, exp( max( n+ t+ t ) ). (9) t

23 5.5. Step 5: utting it all together Using (9) and (5) in () leads to Gap( ) Gap( )min Gap( ) ( v exp( max h E(v, h))) 8 (N +) 4 ( v exp( min h E(v, h))) min, exp( max( n+ t+ t ) ), t which we further bound by making use of the factorisation of the marginal distribution of the hidden RBM variables. As can, for example, be seen from equation (0) in [3], the marginal distribution of the hidden units can be written as (h) = Z v exp( E(v, h)) = Q n Z i= exp(c ih i ) Q m j= (+exp(b j+ n i= w ijh i )). This leads to v exp( max h E(v, h))) v exp( min h E(v, h)) Q n i= apple exp(min Q m h i c i h i ) j= Q ( + exp(b j + n i= min h i w ij h i )) n i= exp(max Q h i c i h i ) m j= ( + exp(b j + n i= max h i w ij h i )) n my ( + exp(b j + n i= =exp( c i ) w iji [,0] (w ij ))) ( + exp(b i j= j + n i= w iji [0,+] (w ij ))) my =exp( c) f j. with c = n i c i and f j = ( + exp(b j + n i= w iji [,0] (w ij ))) ( + exp(b j + n i= w iji [0,+] (w ij ))) = sig( b n j i= w iji [0,+] (w ij )) sig( b n j i= w iji [,0] (w ij )) = min h (v j = h) max h (v j = h), where sig(x) =. Therefore, f +exp( x) j is just the fraction of the minimal and maximal activation of the jth visible unit of the RBM. j= 3

24 Thus, we get Gap( ) exp( c) 8 (N +) 4 exp( c) =min n+0 (N +) 4 my fj min, exp( max( n+ t+ t ) ) t j= my fj, exp( c max t ( t+ t ) ) my f 8 (N +) 4 j. j= It is possible to repeat the proof while changing the roles of hidden and visible variables, which finally leads to (6) and proves our main result. 6. Discussion Theorem and corollary are the first non-trivial results on the convergence rate of T for sampling from binary RBMs. As shown in corollary, they lead to an upper bound on the approximation error when applying the analyzed T method to gradient estimation in RBM training. The upper bound depends exponentially on the size of one layer and the absolute values of the RBM parameters. The fewer the number of nodes and/or the smaller the amount of the parameters, the faster the convergence. This intuitive result resembles the bounds on the approximation bias in contrastive divergence learning [7]. Actually, our current hypothesis is that RBM T chains are in general not rapidly mixing, and that it is in general not possible to get rid of the exponential dependencies in the RBM sampling complexity. This is the case for some related models such as variants of the otts model [6, ]. Because our analysis considers the convergence to the stationary distribution of the product chain consisting of all replica chains, we get an undesired additional linear dependency on the number of replicas. However, such a dependency seems to be inevitable for this type of analysis, and we are not aware of any approach bounding the original chain. The terms f j, j =,...,m,andg i, i =,...,n,arethefractionsof the minimal and maximal probability to be activated of visible neuron j and hidden neuron i given states of the respective other layer. They are measures of how strongly neurons vary with changing input. The observed dependency is in accordance with the expectation that highly variable chains require more samples to reach the stationary distribution. Furthermore, the minimal and maximal probabilities to be activated give information about how close the 4 j=

25 conditional distributions are to being uniformly distributed. Gibbs sampling relies on these conditional distributions, and it is optimal if all neurons are activated with probability 0.5 (itreachesthestationarydistributioninonly one sampling step in this case) and gets more and more deterministic as closer these probabilities get to zero or one and, thus, mixing slows down as our bound indicates. When inspecting the role of the inverse temperatures in (6), we see that the term max t ( t+ t )getsminimal,if 0,..., N are uniformly distributed between 0 and. In this case max t ( t+ t )=.Thistheoreticalresultis in N accordance with the heuristic choice of a uniform inverse temperature spacing as used in many applications (e.g., see [9, ]). So far, to our knowledge, the only theoretically justified heuristic for spacing the inverse temperatures in the tempered transitions literature is choose the 0,..., N geometrically (i.e., i+/ i =const)[3,4]. Geometrictemperaturespacinghasbeen shown to be optimal for sampling from Gaussian distributions by Neil [3], and this result has been extended to a more general class of distribution by Behrens et al. [4]. However, RBMs do not fall into this class and a geometric spacing is sub-optimal for sampling RBMs [5]. Desjardins et al. suggest to use an adaptation scheme that dynamically adjusts the inverse temperatures during RBM training (starting from a uniform distribution). In their experiments, the higher inverse temperatures are adapted to values that are much closer together than a geometric spacing would suggest [5], which also supports rather a uniform than a geometric static inverse temperature spacing. We regard the T variant we analyzed as the typical algorithm considered in theoretical analyses of T, for instance by Madras and Zheng [4] and Woodard et al. []. Still, it di ers from T most often used for RBMs in practice. However, as mentioned in section.4, in this T variant the corresponding transition operator is not reversible because of several reasons. Firstly, it consists of an update step in which blockwise Gibbs sampling is performed in all tempered chains. This is not reversible due to the fixed order of first sampling h k+ based on p(h v k )followedbysamplingv k+ based on p(v h k+ ), where the superscripts denote the sampling steps. Secondly, the swapping is usually organized in two substeps, where even temperatures are considered in the first and odd temperatures in the second substep. And, last but not least, we have = TQ instead of = QT Q. Lacking reversibility prohibits our type of analysis, and we are not aware of other strategies not requiring the property. Of course, one can design T 5

26 operators that are closer to the one used for RBM sampling in practice but still reversible. For example, we can define = QT Q,chooseT to perform an update of all visible or all hidden variables (each picked with probability 0.5) in all tempered chains simultaneously, and Q to trying to swap with equal probability either all even or all odd chains in parallel. We have adapted our analysis to this T variant. However, the resulting bound was actually less tight (among others due to the fact that T is not defined as an product chain anymore, but is of the form T = Q N t=0 T t,andthustheorem5cannot be applied) and the proof did not provide additional insights. 7. Conclusion arallel tempering (T) is state-of-the-art for sampling restricted Boltzmann machines (RBMs). We presented a first analysis of the convergence rate of the Markov chains of the T algorithm as considered by Madras and Zheng [4] and Woodard et al. [] for sampling RBMs by deriving an arguably loose, but non-trivial bound on the spectral gap. We find an exponential dependency on the maximum size of the two layers and the absolute values of the RBM parameters in accordance with existing bounds on the approximation bias in contrastive divergence learning. We hypothesise that RBM T chains are in general not rapidly mixing, and one can in general not get rid of the exponential dependencies on the number of variables in the RBM sampling complexity. We regard proving conditions under which RBMs are torpid mixing as an interesting question for future research. Our bound on the spectral gap gets minimal if the inverse temperatures are spaced uniformly, which is a common choice in practice. We do not claim that this spacing is optimal, in particular an adaptive spacing seems to be a good idea. Still, the result matches the practical experience that a uniform spacing (either fixed or as starting point for adaptation) is preferable to a geometric one when using T for sampling RBMs. The bound on the convergence rate naturally leads to a bound on the gradient approximation error of the method when used during RBM training, which resembles bounds on the approximation error of contrastive divergence learning [7]. 6

27 References []. Smolensky, Information rocessing in Dynamical Systems: Foundations of Harmony Theory, in: D. E. Rumelhart, J. L. McClelland (Eds.), arallel Distributed rocessing: Explorations in the Microstructure of Cognition, vol. : Foundations, MIT ress, 94 8, 986. [] G. E. Hinton, Training roducts of Experts by Minimizing Contrastive Divergence, Neural Computation 4 (00) [3] A. Fischer, C. Igel, Training Restricted Boltzmann Machines: An Introduction, attern Recognition 47 (03) [4] G. E. Hinton, R. R. Salakhutdinov, Reducing the Dimensionality of Data with Neural Networks, Science 33 (5786) (006) [5] A. Fischer, C. Igel, Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines, in: K. Diamantaras, W. Duch, L. S. Iliadis (Eds.), International Conference on Artificial Neural Networks (ICANN 00), vol of LNCS, Springer-Verlag, 08 7, 00. [6] Y. Bengio, O. Delalleau, Justifying and Generalizing Contrastive Divergence, Neural Computation (6) (009) [7] A. Fischer, C. Igel, Bounding the Bias of Contrastive Divergence Learning, Neural Computation 3 (0) [8] C. J. Geyer, Markov chain Monte Carlo maximum likelihood, in: E. Kerami (Ed.), roceedings of the 3rd Symposium on the Interface of Computing Science and Statistics, Interface Foundation of North America, 56 63, 99. [9] R. Salakhutdinov, Learning in Markov Random Fields using Tempered Transitions, in: Y. Bengio, D. Schuurmans, J. La erty, C. K. I. Williams, A. Culotta (Eds.), Advances in Neural Information rocessing Systems (NIS ), , 009. [0] G. Desjardins, A. Courville, Y. Bengio,. Vincent, O. Dellaleau, arallel Tempering for Training of Restricted Boltzmann Machines, Journal of Machine Learning Research Workshop and Conference roceedings 9 (AISTATS 00) (00)

28 [] K. Cho, T. Raiko, A. Ilin, arallel tempering is e cient for learning restricted Boltzmann machines, in: roceedings of the International Joint Conference on Neural Networks (IJCNN 00), IEEE ress, , 00. [] D. B. Woodard, S. C. Schmidler, M. Huber, Conditions for Rapid Mixing of arallel and Simulated Tempering on Multimodal Distributions, The Annals of Applied robability 9 (009) [3]. Diaconis, L. Salo -Coste, What do we know about the Metropolis algorithm?, Journal of Computer and System Sciences 57 (998) [4] N. Madras, Z. Zheng, On the swapping algorithm, Random Structures Algorithms (003) [5] K. Brügge, A. Fischer, C. Igel, The flip-the-state transition operator for Restricted Boltzmann Machines, Machine Learning 3 (03) [6] N. Bhatnagar, D. Randall, Torpid Mixing of Simulated Tempering on the otts Model, in: roceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Society for Industrial and Applied Mathematics, hiladelphia, A, USA, , 004. [7] S. Caracciolo, A. elissetto, A. D. Sokal, Two remarks on simulated tempering, Unpublished manuscript. [8] N. Madras, D. Randall, Markov chain decomposition for convergence rate analysis, The Annals of Applied robability (58 606). [9]. Diaconis, L. Salo -Coste, Comparison theorems for reversible Markov chains, The Annals of Applied robability 3 (993) [0]. Brémaud, Markov chains: Gibbs fields, Monte Carlo simulation, and queues, Springer-Verlag, 999. []. Diaconis, L. Salo -Coste, Logarithmic Sobolev inequalities for finite Markov chains, The Annals of Applied robability 6 (996) [] D. Woodard, S. Schmidler, M. Huber, Su cient Conditions for Torpid Mixing of arallel and Simulated Tempering, Electronic Journal of robability 4 (009)

29 [3] R. Neal, Sampling from multimodal distributions using tempered transitions, Statistics and Computing 6 (996) [4] G. Behrens, N. Friel, M. Hurn, Tuning tempered transitions, Statistics and Computing () (0) [5] G. Desjardins, A. Courville, Y. Bengio, Adaptive arallel Tempering for Stochastic Maximum Likelihood Learning of RBMs, in: NIS 00 Workshop on Deep Learning and Unsupervised Feature Learning, 00. Appendix A. Bounding the bias of a MCMC-estimate in terms of the convergence rate The convergence rate of a Markov chain can be used to bound the bias of an approximation of the expected value of a function under the model distribution based on samples from the chain. This is demonstrated in the following. Consider a Markov chain on the state space. Let be the stationary distribution of the chain and k (z, ) be the distribution obtained by running the Markov chain for k steps starting from a fixed sample z. Furthermore, let t :! R be an arbitrary function. Then, a bound on the bias of an estimate that approximates the expected value of t(x) under based on samples from k (z, ) is given by (x)t(x) x k (z, x))t(x) = x apple ( (x) x x ( (x) k (z, x)t(x) k (z, x))t(x) apple max t(x) (x) k (z, x x = max t(x) k k (z, )k where k k (z, )k is the variation distance, which can be bounded, for example, using equation (3). 9

30 Appendix B. Relation between the convergence rates of the T product chain and the original chain In the following, we proof that bounding the convergence rate of the T product chain also bounds the convergence rate of the original chain (i.e., the chain with inverse temperature N = ). Assume that for the product chain holds NY k (y, x) t (x [t] ) <, x t=0 for an arbitrary starting point y. Thenwecanwrite k (y, x) x [ N] x [N] NY t (x [t] ) <, t=0 which implies > x [N] x [ N] k (y, x) NY t (x [t] ) t=0 x [N] x [ N] ( k (y, x) NY t (x [t] )). t=0 Thus, also upper bounds the variation distance for the original chain at temperature : x [N] NY ( k (y, x) t (x [t] )) = k (y, x) x [ N] t=0 x [N] x [ N] = NY k N(y, x [0] ) t (x [t] ) x [N] t=0 x [ N] x [ N] t=0 NY t (x [t] ) = NY k N(y, x [0] ) N (x [0] ) t (x [t] ) x [N] x [ N] t=0 = N(y, k x [0] ) N (x [0] ) < x [N] 30

Introduction to Restricted Boltzmann Machines

Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,