Learning Algorithm of Boltzmann Machine Based on Spatial Monte Carlo Integration Method

Size: px

Start display at page:

Download "Learning Algorithm of Boltzmann Machine Based on Spatial Monte Carlo Integration Method"

Georgina Bennett
5 years ago
Views:

algorthms Artcle Learnng Algorthm of Boltzmann Machne Based on Spatal Monte Carlo Integraton Method Munek Yasuda Graduate School of Scence and Engneerng, Yamagata Unversty, Yamagata 992-8510, Japan;

p Receved: 28 February 2018; Accepted: 3 Aprl 2018; Publshed: 4 Aprl 2018 Abstract: The machne learnng technques for Markov random felds are fundamental n varous felds nvolvng pattern recognton, mage

However, the nference and learnng problems n the Boltzmann machne are P-hard.

1 algorthms Artcle Learnng Algorthm of Boltzmann Machne Based on Spatal Monte Carlo Integraton Method Munek Yasuda Graduate School of Scence and Engneerng, Yamagata Unversty, Yamagata , Japan; Receved: 28 February 2018; Accepted: 3 Aprl 2018; Publshed: 4 Aprl 2018 Abstract: The machne learnng technques for Markov random felds are fundamental n varous felds nvolvng pattern recognton, mage processng, sparse modelng, and earth scence, and a Boltzmann machne s one of the most mportant models n Markov random felds. However, the nference and learnng problems n the Boltzmann machne are P-hard. The nvestgaton of an effectve learnng algorthm for the Boltzmann machne s one of the most mportant challenges n the feld of statstcal machne learnng. In ths paper, we study Boltzmann machne learnng based on the frst-order) spatal Monte Carlo ntegraton method, referred to as the 1-SMCI learnng method, whch was proposed n the author s prevous paper. In the frst part of ths paper, we compare the method wth the maxmum pseudo-lkelhood estmaton MPLE) method usng a theoretcal and a numercal approaches, and show the 1-SMCI learnng method s more effectve than the MPLE. In the latter part, we compare the 1-SMCI learnng method wth other effectve methods, rato matchng and mnmum probablty flow, usng a numercal experment, and show the 1-SMCI learnng method outperforms them. Keywords: machne learnng; Boltzmann machne; Monte Carlo ntegraton; approxmate algorthm 1. Introducton The machne learnng technques for Markov random felds MRFs) are fundamental n varous felds nvolvng pattern recognton [1,2], mage processng [3], sparse modelng [4], and Earth scence [5,6], and a Boltzmann machne [7 9] s one of the most mportant models n MRFs. The nference and learnng problems n the Boltzmann machne are P-hard, because they nclude ntractable multple summatons over all the possble confguratons of varables. Thus, one of the maor challenges of the Boltzmann machne s the desgn of the effcent nference and learnng algorthms that t requres. Varous effectve algorthms for Boltzmann machne learnng were proposed by many researchers, a few of whch are mean-feld learnng algorthms [10 15], maxmum pseudo-lkelhood estmaton MPLE) [16,17], contrastve dvergence CD) [18], rato matchng RM) [19], and mnmum probablty flow MPF) [20,21]. In partcular, the CD and MPLE methods are wdely used. More recently, the author proposed an effectve learnng algorthm based on the spatal Monte Carlo ntegraton SMCI) method [22]. The SMCI method s a Monte Carlo ntegraton method that takes spatal nformaton around the regon of focus nto account; t was proven that ths method s more effectve than the standard Monte Carlo ntegraton method. The man target of ths study s Boltzmann machne learnng based on the frst-order SMCI 1-SMCI) method, whch s the smplest verson of the SMCI method. We refer t to as the 1-SMCI learnng method n ths paper. It was emprcally shown through the numercal experments that Boltzmann machne learnng based on the 1-SMCI learnng method s more effectve than MPLE n the case where no model error exsts,.e., n the case where the learnng model ncludes the generatve model [22]. However, Algorthms 2018, 11, 42; do: /a

2 Algorthms 2018, 11, 42 2 of 13 the theoretcal reason for ths was not revealed at all. In ths paper, theoretcal nsghts nto the effectveness of the 1-SMCI learnng method as compared to that of MPLE are gven from an asymptotc pont of vew. The theoretcal results obtaned n ths paper state that the gradents of the log-lkelhood functon obtaned by the 1-SMCI learnng method consttute a quanttatvely better approxmaton of the exact gradents than those obtaned by the MPLE method n the case where the generatve model and the learnng model are the same Boltzmann machne n Secton 4.1). Ths s one of the contrbutons of ths paper. In the prevous paper [22], the 1-SMCI learnng method was compared wth only the MPLE. In ths paper, we compare the 1-SMCI learnng method wth other effectve learnng algorthms, RM and MPF, through numercal experments, and show that the 1-SMCI learnng method s superor to them n Secton 5). Ths s the second contrbuton of ths paper. The remander of ths paper s organzed as follows. The defnton of Boltzmann machne learnng and a brefly explanaton of the MPLE method are gven n Secton 2. In Secton 3, we explan Boltzmann machne learnng based on the 1-SMCI method: revews of the SMCI and 1-SMCI learnng methods are presented n Sectons 3.1 and 3.2, respectvely. In Secton 4, the 1-SMCI learnng method and MPLE are compared usng two dfferent approaches, the theoretcal approach n Secton 4.1) and the numercal approach n Secton 4.2), and the effectveness of the 1-SMCI learnng method as compared to the MPLE s shown. In Secton 5, we numercally compare the 1-SMCI method wth other effectve learnng algorthms and observe that the 1-SMCI learnng method yelds the best approxmaton. Fnally, the concluson s gven n Secton Boltzmann Machne Learnng Consder an undrected and connected graph, G = V, E), wth n nodes, where V := {1, 2,..., n} s the set of labels of nodes and E s the set of labels of undrected lnks; an undrected lnk between nodes and s labeled, ). Snce an undrected graph s now consdered, labels, ) and, ) ndcate the same lnk. On undrected graph G, we defne a Boltzmann machne wth random varables x := {x X V}, where X s the sample space of the varable. It s expressed as [7,9] P BM x w) := 1 Zw) exp where Zw) s the partton functon defned by,) E Zw) := exp w x x ), x,) E w x x ), 1) where x s the multple summaton over all the possble realzatons of x;.e., x = V x X. Here and n the followng, f x s contnuous, x X s replaced by ntegraton. w := {w, ), ) E} represents the symmetrc couplng parameters w = w ). Although a Boltzmann machne can nclude a bas term, e.g., V b x, n ts exponent, t s gnored n ths paper for the sake of the smplcty of arguments. Suppose that a set of data ponts correspondng to x, D := {x µ) µ = 1, 2,..., } where x µ) := {x µ) X V}, s obtaned. The goal of Boltzmann machne learnng s to maxmze the log-lkelhood lw; D) := 1 ln P BM x µ) w) 2) wth respect to w, that s, the maxmum lkelhood estmaton MLE). The Boltzmann machne, wth w that maxmzes Equaton 2), yelds the dstrbuton most smlar to the data dstrbuton also referred

3 Algorthms 2018, 11, 42 3 of 13 to as the emprcal dstrbuton) n the perspectve of the measure based on Kullback Lebler dvergence KLD). Ths fact can be easly seen n the followng. The emprcal dstrbuton of D s expressed as Q D x) := 1 V δx, x µ) ), 3) where δa, b) s the Kronecker delta functon: δa, b) = 1 when a = b, and δa, b) = 0 when a = b. The KLD between the emprcal dstrbuton and the Boltzmann machne n Equaton 1), Q D KL [Q D P BM ] := Q D x) ln D x) x P BM x w), 4) can be rewrtten as D KL [Q D P BM ] = lw; D) + C, where C s the constant unrelated to w. From ths equaton, we determne that w that maxmzes the log-lkelhood n Equaton 2) mnmzes the KLD. Snce the log-lkelhood n Equaton 2) s the concave functon wth respect to w [8], n prncple, we can optmze the log-lkelhood usng a gradent ascent method. The gradent of the log-lkelhood wth respect to w s MLE lw; D) w; D) := = 1 w x µ) x µ) E BM [x x w], 5) where E BM [ w] := x )P BM x w) s the expectaton of the assgned quantty over the Boltzmann machne n Equaton 1). In the optmal pont of the MLE, all the gradents are zero, and therefore, from Equaton 5), the optmal w s the soluton to the smultaneous equatons 1 x µ) x µ) = E BM [x x w]. 6) When the data ponts are generated ndependently from a Boltzmann machne, P BM x w gen ), defned on the same graph as the Boltzmann machne we use n the learnng,.e., the case wthout the model error, the soluton to the MLE, w MLE, converges to w gen as [23]. In other words, the MLE s asymptotcally consstent. However, t s dffcult to compute the second term n Equaton 5), because the computatons of these expectatons need the summaton over O2 n ) terms. Thus, the exact Boltzmann machne learnng,.e., the MLE, cannot be performed. As mentoned n Secton 1, varous approxmatons for Boltzmann machne learnng were proposed by many authors, such as the mean-feld learnng methods [10 15] and the MPLE [16,17], CD [18], RM [19], MPF [20,21] and SMCI [22] methods. In the followng, we brefly revew the MPLE method. In MPLE, we maxmze the followng pseudo-lkelhood [16,17,24] nstead of the true log-lkelhood n Equaton 2). l MPLE w; D) := 1 V ln P BM x µ) x µ), w), 7) {} where x A := {x A V} s the varables n A and A := V \ A;.e., x {} = x \ {x }. The condtonal dstrbuton n the above equaton s the condtonal dstrbuton n the Boltzmann machne expressed by P BM x x {}, w) = P BM x w) x X P BM x w) = exp U x ), w)x ) x X exp U x ), w)x ), 8)

4 Algorthms 2018, 11, 42 4 of 13 where U x ), w) := w x, 9) ) where ) V s the set of labels of nodes drectly connected to node ;.e., ) := {, ) E}. The dervatve of the pseudo-lkelhood wth respect to w s m MPLE w; D) s defned by l MPLE w; D) 1 = 2 w x µ) x µ) ) m MPLE w; D). 10) m MPLE w; D) := 1 2 { µ) x M x µ) ), w) + xµ) M x µ) ), w)}, 11) where M x ), w) := x X x exp U x ), w)x ) x X exp U x ), w)x ) 12) and where, for a set A V, x µ) A s the µ-th data pont correspondng to x A;.e., x µ) A = {xµ) A V}. When X = { 1, +1}, M x ), w) = tanh U x ), w). In order to ft the magntude of the gradent to that of the MLE, we use half of Equaton 10) as the gradent of the MPLE MPLE w; D) := 1 x µ) x µ) m MPLE w; D). 13) The order of the total computatonal complexty of the gradents n Equaton 11) s O E ), where E s the number of lnks n GV, E). The pseudo-lkelhood s also the concave functon wth respect to w, and therefore, one can optmze t usng a gradent ascent method. The typcal performance of the MPLE method s almost the same as or slghtly better than that of the CD method n Boltzmann machne learnng [24]. From Equaton 13), the optmal w n the MPLE s the soluton to the smultaneous equatons 1 x µ) x µ) = m MPLE w; D). 14) By comparng Equaton 6) wth Equaton 14), t can be seen that the MPLE s the approxmaton of the MLE such that E BM [x x w] m MPLE w; D). Many authors proved that the MPLE s also asymptotcally consstent for example, [24 26]), that s, n the case wthout model error, the soluton to the MPLE, w MPLE, converges to w gen as. However, the asymptotc varance of the MPLE s larger than that of the MLE [25]. 3. Boltzmann Machne Learnng Based on Spatal Monte Carlo Integraton Method In ths secton, we present the revews of both the SMCI method and the applcaton of the frst-order of the SMCI method to Boltzmann machne learnng,.e., the 1-SMCI learnng method Spatal Monte Carlo Integraton Method {s µ) Assume that we have a set of..d. sample ponts, S := {s µ) µ = 1, 2,..., }, where s µ) := X V}, drawn from a Boltzmann machne, P BM x w), by usng a Markov chan Monte

5 Algorthms 2018, 11, 42 5 of 13 Carlo MCMC) method. Suppose that we want to know the expectaton of a functon f x C ), C V, for the Boltzmann machne E BM [ f x C ) w]. In the standard Monte Carlo ntegraton MCI) method, we approxmate the desred expectaton by the smple average of the gven sample ponts S: E BM [ f x C ) w] x f x C )Q S x) = 1 f s µ) C ), 15) where Q S x) s the dstrbuton of the sample ponts, whch s defned n the same manner as Equaton 3), and where, for a set A V, s µ) A s the µ-th sample pont correspondng to x A;.e., = {sµ) A V}. s µ) A The SMCI method consders spatal nformaton around x C, n contrast to the standard MCI method. For the SMCI method, we defne the neghborng regons of the target regon, C V, as follows. The frst-nearest-neghbor regon, 1 C), s defned by 1 C) := {, ) E, C, C}. 16) Therefore, when C = {}, 1 C) = ). Smlarly, the second-nearest-neghbor regon, 2 C), s defned by 2 C) := {, ) E, 1 C), C, 1 C)}. 17) In a smlar manner, for k 1, we defne the k-th-nearest-neghbor regon, k C), by k C) := {, ) E, k 1 C), R k 1 C)}, 18) where R k C) := k m=0 m C) and 0 C) := C. An example of the neghborng regons n a square-grd graph s shown n Fgure a) b) Fgure 1. Example of the neghborng regons: a) when C = {13}, 1 C) = {8, 12, 14, 18}, 2 C) = {3, 7, 9, 11, 15, 17, 19, 23}, and R 2 C) = 1 C) 2 C), and b) when C = {12, 13} and 1 C) = {7, 8, 11, 14, 17, 18}. By usng the condtonal dstrbuton, and the margnal dstrbuton, P BM x Rk 1 C) x k C), w) = P BM x w) xrk 1 C) P BMx w), 19) P BM x k C) w) = P BM x w), 20) x V\k C)

6 Algorthms 2018, 11, 42 6 of 13 the desred expectaton can be expressed as E BM [ f x C ) w] = x Rk 1 C) x k C) f x C )P BM x Rk 1 C) x k C), w)p BMx k C) w), 21) where, for a set A V, xa = A x X. In Equaton 19), we used the Markov property of the Boltzmann machne: P BM x Rk 1 C) x V\Rk 1 C), w) = P BMx Rk 1 C) x k C), w). In the k-th-order SMCI k-smci) method, E BM [ f x C ) w] n Equaton 21) s approxmated by E k [ f x C ) w, S] := x Rk 1 C) x k C) = 1 x Rk 1 C) f x C )P BM x Rk 1 C) x k C), w)q Sx) f x C )P BM x Rk 1 C) s µ), w). 22) k C) The k-smci method takes the spatal nformaton up to the k 1)-th-nearest-neghbor regon nto account, and t approxmates the outsde of t namely, the k-th-nearest-neghbor regon) by the sample dstrbuton. For the SMCI method, two mportant facts were theoretcally proven [22]: ) the SMCI method s asymptotcally better than the standard MCI method and ) a hgher-order SMCI method s better asymptotcally than a lower-order one Boltzmann Machne Learnng Based on Frst-Order SMCI Method Applyng the 1-SMCI method to Boltzmann machne learnng s acheved by approxmatng the ntractable expectatons, E BM [x x w], by the 1-SMCI method n Equaton 22) wth k = 1. Although Equaton 22) requres sample ponts S drawn from P BM x w), as dscussed n the prevous secton, we can avod the samplng by usng dataset D nstead of S [22]. We approxmate E BM [x x w] by Snce m 1SMCI w; D) := E 1 [x {,} w, D] = 1 x,x X x x P BM x, x x µ), w). 23) 1 {,}) P BM x, x x µ) { U 1 {,}), w) exp x µ) ), w) w x µ) ) x + U x µ) ), w) w x µ) ) } x + w x x, 24) the order of the computatonal complexty of e 1SMCI w; D) s the same as that of m MPLE w; D) wth respect to n. For example, when X = { 1, +1}, Equaton 23) becomes [ { m 1SMCI w; D) = 1 tanh tanh 1 tanh U x µ) ), w) w x µ) ) tanh U x µ) ), w) w x µ) ) } ] + w, 25) where tanh 1 x) s the nverse functon of tanhx). By usng the 1-SMCI learnng method, the true gradent, MLE w; D), s thus approxmated as 1SMCI w; D) := 1 x µ) x µ) m 1SMCI w; D), 26) and therefore, the optmal w n ths approxmaton s the soluton to the smultaneous equatons: 1 x µ) x µ) = m 1SMCI w; D). 27)

7 Algorthms 2018, 11, 42 7 of 13 The order of the total computatonal complexty of the gradents n Equaton 23) s O E ), whch s the same as that of the MPLE. The soluton to Equaton 27) s obtaned by a gradent ascent method wth the gradents n Equaton 26). 4. Comparson of 1-SMCI Learnng Method and MPLE It was emprcally observed n some numercal experments that the 1-SMCI learnng method dscussed n the prevous secton s better than MPLE n the case wthout model error [22]. In ths secton, frst we provde some theoretcal nsghts nto ths observaton, and then some numercal comparsons of the two methods n the cases wth and wthout model error Comparson from Asymptotc Pont of Vew Suppose that we want to approxmate the expectaton E BM [x x w] n a Boltzmann machne, and assume that the data ponts are generated ndependently from the same Boltzmann machne. Here, we re-express m MPLE w; D) n Equaton 11) and m 1SMCI w; D) n Equaton 23) as respectvely, where m MPLE w; D) = 1 ρ MPLE x µ), w), 28) m 1SMCI w; D) = 1 ρ 1SMCI x µ), w), 29) ρ MPLE x, w) := 1 2 { x M x ), w) + x M x ), w) }, ρ 1SMCI x, w) := x x P BM x, x x 1 {,}), w). x X x X Snce x µ) s are the..d. ponts sampled from P BM x w), ρ MPLE x µ), w) and ρ 1SMCI x µ), w) can w; D) are the sample averages also be regarded as..d. sample ponts. Thus, m MPLE w; D) and m 1SMCI over the..d. ponts. One can easly verfy that the two equatons x P BM x w)ρ MPLE x, w) = E BM [x x w] and x P BM x w)ρ 1SMCI x, w) = E BM [x x w] are ustfed the former equaton can also be ustfed by usng the correlaton equalty [27]). Therefore, from the law of large numbers, m MPLE w; D) = m 1SMCI w; D) = E BM [x x w] n the lmt of. Ths mples that, n the case wthout model error, the 1-SMCI learnng method has the same soluton to the MLE n the lmt of. From the central lmt theorem, the dstrbutons of m MPLE w; D) and m 1SMCI w; D) asymptotcally converge to Gaussans wth mean E BM [x x w] and varances v MPLE w) := 1 x v 1SMCI w) := 1 x ρ MPLE x, w) 2 P BM x w) E BM [x x w] 2), 30) ρ SMCI x, w) 2 P BM x w) E BM [x x w] 2), 31) respectvely, for 1. For the two varances, we obtan the followng theorem. Theorem 1. For a Boltzmann machne, P BM x w), defned n Equaton 1), the nequalty v MPLE w) v 1SMCI w) s satsfed for all, ) E and for any. The proof of ths theorem s gven n Appendx A. Theorem 1 states that the varance n the dstrbuton of m MPLE w; D) s always larger than or equal to) that of m 1SMCI w; D). Ths means that,

8 Algorthms 2018, 11, 42 8 of 13 when 1, the dstrbuton of m 1SMCI w; D) converges to a Gaussan around the mean value.e., the exact expectaton) that s sharper than a Gaussan to whch the dstrbuton of m MPLE w; D) converges, and therefore, t s lkely that m 1SMCI 1; that s, m 1SMCI w; D) s closer to the exact expectaton than m MPLE w; D) s a better approxmaton of E BM [x x w] than m MPLE w; D) when w; D). ext, we consder the dfferences between the true gradent n Equaton 5) and the approxmate gradents n Equatons 13) and 26) for w : e MPLE e 1SMCI w; D) := MPLE w; D) := 1SMCI w; D) MLE w; D) = m MPLE w; D) E BM [x x w], 32) w; D) MLE w; D) = m 1SMCI w; D) E BM [x x w]. 33) For the gradent dfferences n Equatons 32) and 33), we obtan the followng theorem. Theorem 2. For a Boltzmann machne, P BM x w), defned n Equaton 1), the nequalty P e MPLE w; D) ε ) P e 1SMCI w; D) ε ), ε > 0, s satsfed for all, ) E when, where D s the set of data ponts generated ndependently from P BM x w). The proof of ths theorem s gven n Appendx B. Theorem 2 states that t s lkely that the magntude of e 1SMCI w; D) s smaller than or equvalent to) that of e MPLE w; D) when the data ponts are generated ndependently from the same Boltzmann machne and when 1. Suppose that data ponts are generated ndependently from a generatve Boltzmann machne, P BM x w gen ), defned on G gen, and that a learnng Boltzmann machne, defned on the same graph as the generatve Boltzmann machne, s traned usng the generatve data ponts. In ths case, snce there s no model error, the solutons of the MLE, the MPLE and the 1-SMCI learnng methods are expected to approach w gen as, that s MLE w gen ; D), MPLE w gen ; D), and 1SMCI w gen ; D) are expected to approach zero as. Consder the case n whch s very large and MLE w gen ; D) = 0. From the statement n Theorem 2, 1SMCI w gen ; D) s statstcally closer to zero than MPLE w gen ; D). Ths mples that the soluton of the 1-SMCI learnng method converges to that of the MLE faster than the MPLE. The theoretcal results presented n ths secton have not reached a rgd ustfcaton of the effectveness of the 1-SMCI learnng method, because some ssues stll reman, for nstance: ) snce we do not specfy whether the problem of solvng Equaton 27),.e., a gradent ascent method wth the gradents n Equaton 26), s a convex problem or not, we cannot remove the possblty of exstence local optmal solutons whch degrade the performance of the 1-SMCI learnng method, ) although we dscussed the asymptotc property of 1SMCI w; D) for each lnk separately, a ont analyss of them s necessary for a more rgd dscusson, and ) a perturbatve analyss around the optmal pont s completely lackng. However, we can expect that they consttute evdence that s mportant for ganng nsght nto the effectveness umercal Comparson We generated data ponts from a generatve Boltzmann machne, P BM x w gen ) and then traned a learnng Boltzmann machne of the same sze as the generatve Boltzmann machne usng the generated data ponts. The couplng parameters n the generatve Boltzmann machne, w gen, were generated from a unque dstrbuton, U[ λ, λ]. Frst, we consder the case where the graph structures of the generatve Boltzmann machne and the learnng Boltzmann machne are the same: a 4 4 square grd, that s the case wthout model error. Fgure 2a shows the mean absolute errors between the solutons to the MLE and the approxmate methods the MLPE and the 1-SMCI learnng method),,) E w MLE w approx / E, aganst. Here,

9 Algorthms 2018, 11, 42 9 of 13 we set λ = 0.3. Snce the sze of the Boltzmann machne used here s not large, we can obtan the soluton to the MLE. We observe that the solutons to the two approxmate methods converge to the soluton to the MLE as ncreases, and the 1-SMCI learnng method s better than the MPLE as the approxmaton of the MLE. These results are consstent wth the results obtaned n [22] and the theoretcal arguments n the prevous secton. ext, we consder the case n whch the graph structure of the generatve Boltzmann machne s fully connected wth n = 16 and that n whch the learnng Boltzmann machne s agan a 4 4 square grd, namely the case wth model error. Thus, ths case s completely outsde the theoretcal arguments n the prevous secton. Fgure 2b shows the mean absolute errors between the soluton to the MLE and that to the approxmate methods aganst. Here, we set λ = 0.2. Unlke the above case, the solutons to the two approxmate methods do not converge to the soluton to the MLE because of the model error. The 1-SMCI learnng method s agan better than the MPLE as the approxmaton of the MLE n ths case. By comparng Fgure 2a,b, we observed that the 1-SMCI learnng method n b) s worse than n a). The followng reason can be consdered. In Secton 3.2, we replaced S, whch s the sample ponts drawn from the Boltzmann machne, by D n order to avod the samplng cost. However, ths replacement mples the assumpton of the case wthout model error, and therefore, t s not ustfed n the case wth model error MPLE 1 SMCI MPLE 1 SMCI MAE MAE a) b) Fgure 2. The mean absolute errors MAEs) for varous : a) the case wthout the model error and b) the case wth the model error. Each plot shows the average over 200 trals. MPLE, maxmum pseudo-lkelhood estmaton; 1-SMCI, frst-order spatal Monte Carlo ntegraton method. 5. umercal Comparson wth Other Methods In ths secton, we demonstrate a numercal comparson of the 1-SMCI learnng method wth other approxmaton methods, RM [19] and MPF [20,21]. The orders of the computatonal complexty of these two methods are the same as that of the MPLE and 1-SMCI learnng methods. The four methods were mplemented by a smple gradent ascent method, w t+1) w t) + η, where η > 0 s the learnng rate. As descrbed n Secton 4.2, we generated data ponts from a generatve Boltzmann machne, P BM x w gen ) and then traned a learnng Boltzmann machne of the same sze as the generatve Boltzmann machne usng the generated data ponts. The couplng parameters n the generatve Boltzmann machne, w gen, were generated from U[ 0.3, 0.3]. The graph structures of the generatve Boltzmann machne and of the learnng Boltzmann machne are the same: a 4 4 square grd. Fgure 3 shows the learnng curves of the four methods. The horzontal axs represents the number of the step, t, of the gradent ascent method, and the vertcal axs represents the mean absolute errors between the soluton to the MLE, w MLE, and the values of the couplng parameters at the step, w t). In ths experment, we set η = 0.2, and the values of w were ntalzed as zero.

10 Algorthms 2018, 11, of 13 MAE 10 1 MPLE RM MPF 1 SMCI MAE 10 1 MPLE RM MPE 1 SMCI number of update a) number of update b) Fgure 3. Mean absolute errors MAEs) versus the number of updates of the gradent ascent method: a) = 200 and b) = Each plot shows the average over 200 trals. RM, rato matchng. Snce the vertcal axses n Fgure 3 represents the the mean absolute error from the soluton to the MLE, the lower one s the better approxmaton of the MLE. We can observe that the MPF shows the fastest convergence and the MPLE, RM, and MPF converge to almost the same values, whle the 1-SMCI learnng method converges to the lowest values. Ths concludes that, among the four methods, the 1-SMCI learnng method s the best as the approxmaton of the MLE. However, the 1-SMCI learnng method has a drawback. The MPLE, RM, and MPF are convex optmzaton problems and they have unque solutons, whereas, we do not specfy whether the 1-SMCI learnng method s a convex problem or not n the present stage. We cannot elmnate the possblty of the exstence of local optmal solutons that degrade the accuracy of approxmaton. As mentoned above, the orders of the computatonal complexty of these four methods, the MPLE, RM, MPF, and 1-SMCI learnng methods, are the same, O E ). However, t s mportant to check the real computatonal tmes of these methods. Table 1 shows the total computatonal tmes needed for the one-tme learnng untl convergence), where the settng of the experment s the same as that of Fgure 3b. Table 1. Real computatonal tmes of the four learnng methods. The settng of the experment s the same as that of Fgure 3b. MPLE RM MPF 1-SMCI tme s) The MPF method s the fastest, and the 1-SMCI learnng method s the slowest whch s about 6 7 tmes slower than the MPF method. 6. Conclusons In ths paper, we examned the effectveness of Boltzmann machne learnng based on the 1-SMCI method proposed n [22] where, by numercal experments, t was shown that the 1-SMCI learnng method s more effectve than the MPLE n the case where no model error exsts. In Secton 4.1, we gave the theoretcal results for the emprcal observaton from the asymptotc pont of vew. The theoretcal results mproved our understandng of the advantage of the 1-SMCI learnng method as compared to the MPLE. The numercal experments n Secton 4.2 showed that the 1-SMCI learnng method s a better approxmaton of the MPLE n the case wth and wthout model error. Furthermore, we compared the 1-SMCI learnng method wth the other effectve methods, RM and MPF, usng the numercal experments n Secton 5. The numercal results showed that the 1-SMCI learnng method s the best method.

11 Algorthms 2018, 11, of 13 However, ssues related to the 1-SMCI learnng method stll reman. Snce the obectve functon of the 1-SMCI learnng method, e.g., Equaton 7) for the MPLE, s not revealed, t s not straghtforward to specfy whether the problem of solvng Equaton 27),.e., a gradent ascent method wth the gradents n Equaton 26), s a convex problem or not. Ths s one of the most challengng ssues of the method. As shown n Secton 4.2, the performance of the 1-SMCI learnng method decreases when model error exsts,.e., when the learnng Boltzmann machne does not nclude the generatve model. The decrease may be caused by the replacement of the sample ponts, S, by the data ponts, D, as dscussed n the same secton. It s expected that combnng the 1-SMCI learnng method wth an effectve samplng method, e.g., the persstent contrastve dvergence [28], relaxes the problem of the performance degradaton. The presented the 1-SMCI learnng method can be appled to other types of Boltzmann machnes, e.g., restrcted Boltzmann machne [1], deep Boltzmann machne [2,29]. Although we focused on the Boltzmann machne learnng n ths paper, the SMCI method can be appled to varous MRFs [22]. Hence, there are many future drectons of applcaton of the SMCI: for example, graphcal LASSO problem [4], Bayesan mage processng [3], Earth scence [5,6] and bran-computer nterface [30 33]. Acknowledgments: Ths work was partally supported by JST CREST, Grant umber JPMJCR1402, JSPS KAKEHI, Grant umbers, 15K00330 and 15H03699, and MIC SCOPE Strategc Informaton and Communcatons R&D Promoton Programme), Grant umber Conflcts of Interest: The author declares no conflct of nterest. Appendx A. Proof of Theorem 1 The frst term n Equaton 30) can be rewrtten as: x ρ MPLE x, w) 2 P BM x w) = x x X x X ρ MPLE x, w) 2 P BM x, x x 1 {,}), w) )P BM x w). A1) Snce, for, ) E, x 1 {,}) = x 1 {}) x 1 {}) \ {x, x }, we obtan the two expressons: P BM x, x x 1 {,}), w) = P BMx x, x 1 {,}), w)p BMx x 1 {,}), w) and the expresson, obtaned by alternatng and, = P BM x x 1 {}), w)p BMx x 1 {,}), w) A2) P BM x, x x 1 {,}), w) = P BMx x 1 {}), w)p BMx x 1 {,}), w). A3) From Equatons A2) and A3), we obtan: ρ SMCI x, w) = 1 ) 2 x M x ), w)p BM x x 1 {,}), w) + x M x ), w)p BM x x 1 {,}), w) x X x X = ρ MPLE x, w)p BM x, x x 1 {,}), w), A4) x X x X where we use the relaton M x ), w) = x X x P BM x x 1 {}), w). From ths equaton, we obtan x ρ 1SMCI x, w) 2 P BM x w) = x x X x X ρ MPLE x, w)p BM x, x x 1 {,}), w) ) 2PBM x w). A5) Fnally, from Equatons A1) and A5), the nequalty: v MPLE w) v 1SMCI w) = 1 x { x X x X ρ MPLE x, w) 2 P BM x, x x 1 {,}), w) x X x X ρ MPLE x, w)p BM x, x x 1 {,}), w) ) 2 }P BM x w) 0 A6) s obtaned.

12 Algorthms 2018, 11, of 13 Appendx B. Proof of Theorem 2 As mentoned n Secton 4.1, from the central lmt theorem, the dstrbuton of m MPLE w; D) converges to the Gaussan wth mean E BM [x x w] and varance v MPLE w) for. Therefore, the dstrbuton of e MPLE w; D) converges to the Gaussan wth mean zero and varance v MPLE w). Ths leads to P e MPLE w; D) ε ) = P ε e MPLE w; D) ε ) ε ε t v MPLE w))dt ), A7) where t σ 2 ) := exp{ t 2 /2σ 2 )}/ 2σ 2. Equaton A7) s expressed as: ε ε t v MPLE w))dt = erf ε 2v MPLE w) ) A8) by usng the error functon We obtan erfx) := 1 π x P e 1SMCI w; D) ε ) erf x e t2 dt. ε 2v 1SMCI w) ) A9) ) A10) by usng the same dervaton as Equaton A8). Because the error functon n Equaton A9) s the monotoncally ncreasng functon, from the statement n Theorem 1,.e., v MPLE w) v 1SMCI w), we obtan ε ) ε ) erf erf. A11) w) w) 2v MPLE Ths nequalty leads to the statement of Theorem 2. References 2v 1SMCI 1. Hnton, G.E.; Osndero, S.; Teh, Y.W. A fast learnng algorthm for deep belef net. eural Comput. 2006, 18, Salakhutdnov, R.; Hnton, G.E. An Effcent Learnng Procedure for Deep Boltzmann Machnes. eural Comput. 2012, 24, Blake, A.; Kohl, P.; Rother, C. Markov Random Felds for Vson and Image Processng; The MIT Press: Cambrdge, MA, USA, Rsh, I.; Grabarnk, G. Sparse Modelng: Theory, Algorthms, and Applcatons; CRC Press: Boca Raton, FL, USA, Kuwatan, T.; agata, K.; Okada, M.; Torum, M. Markov random feld modelng for mappng geoflud dstrbutons from sesmc velocty structures. Earth Planets Space 2014, 66, Kuwatan, T.; agata, K.; Okada, M.; Torum, M. Markov-random-feld modelng for lnear sesmc tomography. Phys. Rev. E 2014, 90, Ackley, D.H.; Hnton, G.E.; Senowsk, T.J. A learnng algorthm for Boltzmann machnes. Cogn. Sc. 1985, 9, Wanwrght, M.J.; Jordan, M.I. Graphcal Models, Exponental Famles, and Varatonal Inference. Found. Trends Mach. Learn. 2008, 1, Murphy, K.P. Machne Learnng: A Probablstc Perspectve; MIT Press: Cambrdge, MA, USA, Kappen, H.J.; Rodríguez, F.B. Effcent Learnng n Boltzmann Machnes Usng Lnear Response Theory. eural Comput. 1998, 10,

Algorthms 2018, 11, 42 13 of 13 11. Tanaka, T. Mean-feld theory of Boltzmann machne learnng. Phys. Rev. E 1998, 58, 2302 2310. 12. Yasuda, M.; Tanaka, K.

13 Algorthms 2018, 11, of Tanaka, T. Mean-feld theory of Boltzmann machne learnng. Phys. Rev. E 1998, 58, Yasuda, M.; Tanaka, K. Approxmate Learnng Algorthm n Boltzmann Machnes. eural Comput. 2009, 21, Sessak, V.; Monasson, R. Small-correlaton expansons for the nverse Isng problem. J. Phys. A Math. Theor. 2009, 42, Furtlehner, C. Approxmate nverse Isng models close to a Bethe reference pont. J. Stat. Mech. Theor. Exp. 2013, 2013, P Roud, Y.; Aurell, E.; Hertz, J. Statstcal physcs of parwse probablty models. Front. Comput. eurosc. 2009, 3, Besag, J. Statstcal Analyss of on-lattce Data. J. R. Stat. Soc. D 1975, 24, Aurell, E.; Ekeberg, M. Inverse Isng Inference Usng All the Data. Phys. Rev. Lett. 2012, 108, Hnton, G.E. Tranng products of experts by mnmzng contrastve dvergence. eural Comput. 2002, 8, Hyvärnen, A. Estmaton of non-normalzed statstcal models usng score matchng. J. Mach. Learn. Res. 2005, 6, Sohl-Dcksten, J.; Battaglno, P.B.; DeWeese, M.R. Mnmum Probablty Flow Learnng. In Proceedngs of the 28th Internatonal Conference on Internatonal Conference on Machne Learnng ICML 11), Bellevue, WA, USA, 28 June 2 July 2011; pp Sohl-Dcksten, J.; Battaglno, P.B.; DeWeese, M.R. ew Method for Parameter Estmaton n Probablstc Models: Mnmum Probablty Flow. Phys. Rev. Lett. 2011, 107, Yasuda, M. Monte Carlo Integraton Usng Spatal Structure of Markov Random Feld. J. Phys. Soc. Japan 2015, 84, Lehmann, E.L.; Casella, G. Theory of Pont Estmaton; Sprnger: Berln, Germany, Hyvärnen, A. Consstency of Pseudo lkelhood Estmaton of Fully Vsble Boltzmann Machnes. eural Comput. 2006, 18, Lndsay, B.G. Composte Lkelhood Methods. Contemporary Math. 1988, 80, Jensen, J.L.; Møller, J. Pseudolkelhood for Exponental Famly Models of Spatal Pont Processes. Ann. Appl. Probab. 1991, 1, Suzuk, M. Generalzed Exact Formula for the Correlatons of the Isng Model and Other Classcal Systems. Phys. Lett. 1965, 19, Teleman, T. Tranng Restrcted Boltzmann Machnes Usng Approxmatons to the Lkelhood Gradent. In Proceedngs of the 25th Internatonal Conference on Machne Learnng ICML), Helsnk, Fnland, 5 9 July Salakhutdnov, R.; Hnton, G.E. Deep Boltzmann Machnes. In Proceedngs of the 12th Internatonal Conference on Artfcal Intellgence and Statstcs AISTATS 2009), Clearwater Beach, FL, USA, Aprl 2009; pp Wang, H.; Zhang, Y.; Waytowch,.R.; Krusensk, D.J.; Zhou, G.; Jn, J.; Wang, X.; Cchock, A. Dscrmnatve Feature Extracton va Multvarate Lnear Regresson for SSVEP-Based BCI. IEEE Trans. eural Syst. Rehabltat. Eng. 2016, 24, Zhang, Y.; Zhou, G.; Jn, J.; Zhao, Q.; Wang, X.; Cchock, A. Sparse Bayesan Classfcaton of EEG for Bran Computer Interface. IEEE Trans. eural etw. Learn. Syst. 2016, 27, Zhang, Y.; Wang, Y.; Zhou, G.; Jn, J.; Wang, B.; Wang, X.; Cchock, A. Mult-kernel extreme learnng machne for EEG classfcaton n bran-computer nterfaces. Expert Syst. Appl. 2018, 96, Jao, Y.; Zhang, Y.; Wang, Y.; Wang, B.; Jn, J.; Wang, X. A ovel Multlayer Correlaton Maxmzaton Model for Improvng CCA-Based Frequency Recognton n SSVEP Bran Computer Interface. Int. J. eural Syst. 2018, 28, c 2018 by the authors. Lcensee MDPI, Basel, Swtzerland. Ths artcle s an open access artcle dstrbuted under the terms and condtons of the Creatve Commons Attrbuton CC BY) lcense

A Note on Bound for Jensen-Shannon Divergence by Jeffreys

OPEN ACCESS Conference Proceedngs Paper Entropy www.scforum.net/conference/ecea- A Note on Bound for Jensen-Shannon Dvergence by Jeffreys Takuya Yamano, * Department of Mathematcs and Physcs, Faculty of