arxiv: v3 [cs.ne] 18 Jan 2019

Size: px

Start display at page:

Download "arxiv: v3 [cs.ne] 18 Jan 2019"

Juniper Shelton
5 years ago
Views:

1 Boltzmann machnes for tme-seres Takayuk Osogam IBM Research - Tokyo arxv: v3 [cs.ne] 18 Jan 2019 osogam@p.bm.com verson Abstract We revew Boltzmann machnes extended for tme-seres. These models often have recurrent structure, and back propagraton through tme BPTT s used to learn ther parameters. The perstep computatonal complexty of BPTT n onlne learnng, however, grows lnearly wth respect to the length of precedng tme-seres.e., learnng rule s not local n tme, whch lmts the applcablty of BPTT n onlne learnng. We then revew dynamc Boltzmann machnes DyBMs, whose learnng rule s local n tme. DyBM s learnng rule relates to spke-tmng dependent plastcty STDP, whch has been postulated and expermentally confrmed for bologcal neural networks. 1 Introducton The Boltzmann machne s a stochastc model for representng probablty dstrbutons over bnary patterns [28]. In ths paper, we revew Boltzmann machnes that have been studed as stochastc generatve models of tme-seres. Such Boltzmann machnes defne probablty dstrbutons over tme-seres of bnary patterns. They can also be modfed to deal wth tme-seres of real-valued patterns, smlar to Boltzmann machnes modfed for real-valued patterns e.g., Gaussan Boltzmann machnes; see Secton 6.3 from [28]. We wll follow the probablstc representatons of [28] for ntutve nterpretatons n terms of probabltes. In Secton 3, we start wth a Condtonal Restrcted Boltzmann Machne CRBM [40], whch s a condtonal Boltzmann machne Secton 4 from [28] that gves condtonal probablty of the next pattern gven a fxed number of precedng patterns. A lmtaton of a CRBM s that t can take nto account only the dependency wthn a fxed horzon, and as we ncrease the length of ths horzon, the complexty of learnng grows accordngly. To overcome ths lmtaton of CRBMs, researchers have proposed Boltzmann machnes havng recurrent structures, whch we revew n Secton 4. These nclude spkng Boltzmann machnes [12], temporal restrcted Boltzmann machnes TRBMs [37], recurrent temporal restrcted Boltzmann machnes RTRBMs [38], and extensons of those models. A standard approach to learnng those models havng recurrent structures s back propagaton through tme BPTT. However, BPTT s undesrable when we learn tme-seres n an onlne manner, where we update the parameters of a model every tme a new pattern arrves. Such onlne learnng s needed when we want to quckly adapt to a changng envronment or when we do not have suffcent memory to store the tme-seres. Unfortunately, the per-step computatonal complexty of BPTT n onlne learnng grows lnearly wth respect to the length of precedng tme-seres. Ths computatonal complexty lmts the applcablty of BPTT to onlne learnng. In Secton 5, we revew the dynamc Boltzmann machne DyBM [32, 31] and ts extensons. The DyBM s per-step computatonal complexty n onlne learnng s ndependent of the length of precedng 1

2 tme-seres. We dscuss how the learnng rule of the DyBM relates to spke-tmng dependent plastcty STDP, whch has been postulated and expermentally confrmed for bologcal neural networks. Ths survey paper s based on a personal note prepared for the thrd of the four parts of a tutoral gven at the 26th Internatonal Jont Conference on Artfcal Intellgence IJCAI-17 held n Melbourne, Australa on August 21, See a tutoral webpage 1 for nformaton about the tutoral. A survey correspondng to the frst part of the tutoral Boltzmann machnes and energy-based models can be found n [28]. We follow the defntons and notatons used n [28]. 2 Learnng energy-based models for tme-seres Consder a possbly mult-dmensonal tme-seres: x x [t] T t=0, 1 where x [t] denotes the bnary pattern vector at tme t. We wll use x [s,t] to denote the tme-seres of the patterns from tme s to t. A goal of learnng tme-seres s to maxmze the log-lkelhood of a gven tme-seres x or a collecton of multple tme-seres wth respect to the dstrbuton P θ defned by a model under consderaton, where we use θ to denote the set of the parameters of the model: fθ log P θ x = T log P θ x [t] x [0,t 1], 2 t=0 where P θ x [t] x [0,t 1] denotes the condtonal probablty that the pattern at tme t s x [t] gven that the patterns up to tme t 1 s x [0,t 1]. Here, P θ x [0] x [0, 1] denotes the probablty that the pattern at tme 0 s x [0], where x [0, 1] should be nterpreted as an empty hstory. We study models where the probablty s represented wth energy E θ as follows: P θ x = h P θ x, h, 3 where exp P θ x, h = exp x h E θ x, h E θ x, h, 4 the summaton wth respect to x s over all of the possble bnary tme-seres of length T, and the summaton wth respect to h s over all of the possble hdden values. The gradent of fθ can then be represented as follows see 73: fθ = E target [E θ [ E θ X, H X]] + E θ [ E θ X, H], 5 where X represents the random tme-seres, H represents the random hdden values, E θ denotes the expectaton wth respect to the model dstrbuton P θ, and E target denotes the expectaton wth respect to the target dstrbuton, whch n our case s the emprcal dstrbuton of tme-seres. When a sngle tme-seres x s gven as the target, 5 s reduced to fθ = E θ [ E θ x, H] + E θ [ E θ X, H], group.php?d=7834 2

3 where In other words, fθ can be maxmzed by maxmzng the sum of P θ x [t] x [0,t 1] = h f t θ log P θ x [t] x [0,t 1], 7 exp P θ x [t], h x [0,t 1] = exp x [t] P θ x [t], h x [0,t 1] 8 h E θ x [t], h x [0,t 1] E θ x [t], h x [0,t 1], 9 and E θ x [t], h x [0,t 1] s the condtonal energy of x [t], h gven x [0,t 1]. The gradent of f t θ s gven analogously to fθ: [ ]] [ ] f t θ = E target E θ [ E θ X [t], H X [t], x [0,t 1] + E θ E θ X [t], H x [0,t 1]. 10 When the target s a sngle tme-seres x, we have ] [ ] f t θ = E θ [ E θ x [t], H x [0,t 1] + E θ E θ X [t], H x [0,t 1] Non-recurrent Boltzmann machnes for tme-seres By 2, any model that can represent the condtonal probablty P θ x [t] x [0,t 1] can be used for tme-seres. In ths secton, we start wth a Boltzmann machne that can be used to model a D-th order Markov model for an arbtrarly determned D. In D-th order Markov models, the condtonal probablty can be represented as P θ x [t] x [0,t 1] = P θ x [t] x [t D,t 1] Condtonal restrcted Boltzmann machnes Fgure 1a shows a partcularly structured Boltzmann machne called Condtonal Restrcted Boltzmann Machne CRBM [40]. A CRBM represents the condtonal probablty on the rght-hand sde of 12. In the fgure, we set D = 2. The CRBM conssts of D + 1 layers of vsble unts and a layer of hdden unts. The unts wthn each layer have no connectons, but unts between dfferent layers may be connected to each other. Each vsble layer corresponds to a pattern at a tme s [t D, t]. The CRBM s a condtonal Boltzmann machne shown n Fgure 2c from [28] but wth a partcular structure to represent tme-seres. The vsble layers correspondng to x [t D,t 1] are the nput, and the vsble layer correspondng to x [t] s the output. The parameters θ of the CRBM are ndependent of t. More formally, the energy of a CRBM s gven by E θ x [t], h x [t D,t 1] = b V x [t] b H h h W HV x [t] where x [t] s output, x [t D,t 1] s nput, and h s hdden. We can then represent the condtonal probablty as follows see 14 from [28]: D x [t d] W [d] x [t], 13 d=1 P θ x [t] x [t D,t 1] = h P θ x [t], h x [t D,t 1], 14 3

4 Hdden Input Output Hdden Vsble Input Output Vsble t 2 t 1 t t 2 t 1 t a Sngle hdden layer b Multple hdden layers Fgure 1: Condtonal restrcted Boltzmann machnes. where P θ x [t], h x [t D,t 1] = exp E θ x [t], h x [t D,t 1] x [t] exp E θ x [t], h 15 x [t D,t 1], and the summaton wth respect to h s over all of the possble bnary hdden patterns, and the summaton wth respect to x [t] s defned analogously. One can then learn the parameters θ = b V, h h, W HV, W [1],, W D of the model by followng a gradent-based method n Secton 4 from [28]. 3.2 Extensons of condtonal restrcted Boltzmann machnes The CRBM has been extended n varous ways. Taylor et al. study a CRBM wth multple layers of hdden unts [40] see Fgure 1b. Memsevc and Hnton study a CRBM extended wth three-way nteractons.e., a hgher order Boltzmann machne, whch they refer to as a gated CRBM [24]. Specfcally, the energy of the gated CRBM nvolves,,k w,,k x y h k, 16 where x denotes nput values, y denotes output values, and h denotes hdden values. A drawback of the gated CRBM s ts ncreased number of parameters due to the three-way nteractons. Taylor and Hnton study a factored CRBM, where the three-way nteracton s represented wth a reduced number of parameters as follows [39]: w,f v w y,f wh k,f x y h k, 17 f,,k where the summaton wth respect to f s over a set of factors under consderaton. 4

5 hat these forward connectons are not requred f, but only for the purposes of extrapolatng nto 0.16 Input Output Hdden Vsble rt t 2 t 1 t Tme a Structure of a spkng Boltzmann machne b r used n [12] 1 Fgure 2: A spkng Boltzmann machne studed n [12]. In b, we use the fgure n the verson avalable at frtz/absps/nps00-ab.pdf. Fgure 3: The form of the temporal k 4 Boltzmann machnes for tme-seres wth recurrent structures 4.1 Spkng Boltzmann machnes A spkng Boltzmann machne studed n [12] can be shown to be essentally equvalent to the Boltzmann machne llustrated n Fgure 2. Ths Boltzmann machne conssts of nput unts, output unts, and hdden unts. The nput unts represent hstorcal values of vsble unts and hdden unts. Although hdden unts are random and cannot be smply gven as nput, Hnton and Brown make the approxmaton of usng sampled values H,t 1] ω as the nput hdden unts [12]. Specfcally, gven the vsble values x [<t] x,t 1] and sampled hdden values H [<t] ω up to tme t 1, the energy wth the vsble values x [t] and hdden values h [t] at tme t can be represented as follows: E θ x [t], h [t] x [<t], H [<t] ω = b V x [t] b H h [t] h [t] rτ W HV x [t] H [t τ] ω rτ W HH h [t] x [t τ] rτ W VH h [t] τ=1 H [t τ] ω rτ W HV x [t] x [t τ] rτ W VV x [t], 18 τ=1 where r s an arbtrarly chosen functon and s not the target of learnng. Namely, the Boltzmann machne has an nfnte number of unts but can be characterzed by a fnte number of parameters θ b V, b H, W VV, W VH, W HV, W HH. Fgure 2b shows the specfc r used n [12] 2. Notce that the Boltzmann machne n Fgure 2 can be seen as a restrcted Boltzmann machne RBM whose bas and weght can depend on x [<t] and H [<t] ω, because 18 can be represented as E θ x [t], h [t] x [<t], H [<t] ω = b H t, ω h [t] b V t, ω x [t] h [t] W x [t], 19 2 Although t s not clear from the descrptons n [12], the labels n the horzontal axs should probably be shfted by one, so that r0 = 0, r1 0.1, and so on. τ=1 τ=1 5

6 where b H t, ω s the tme-varyng bas for hdden unts, b V t, ω s the tme-varyng bas for vsble unts, and W s the weght between vsble unts and hdden unts: b H t, ω b H + H [t τ] ω rτ W HH + x [t τ] rτ W VH 20 τ=1 b V t, ω b V + H [t τ] ω rτ W HV + x [t τ] rτ W VV 21 τ=1 W r0 W HV. 22 We can then represent the condtonal probablty as follows: P θ x [t] x [<t], H,t 1 ω = h[t] τ=1 τ=1 P θ x [t], h [t] x [<t], H,t 1] ω 23 where P θ x [t], h [t] x [<t], H [<t] ω = exp E θ x [t], h [t] x [<t], H,t 1] ω exp E θ x [t], h [t] x [<t], H [<t]. 24 ω x [t] We now dscuss the choce of r0 = 0, whch appears to be the case n Fgure 2b. In ths case, the energy s reduced to E θ x [t], h [t] x [<t], H [<t] ω = b H t, ω h [t] b V t, ω x [t]. 25 Because there are no connectons between vsble unts and hdden unts at tme t, the hdden values at t do not affect the dstrbuton of the vsble values at t. The only role of the hdden unts s that the sampled hdden values are used to update the tme-varyng bas, b V s, ω and b H s, ω for s > t. A problem s that there s no mechansm that allows us to learn approprate values of W VH and W HH untl we observe succeedng vsble values. Namely, the hdden values h [t] are sampled wth the dependency on W VH and W HH, but whether the sampled hdden values are good or not can only be known when those hdden values are used as nput. Ths helps us to learn approprate values of W HV, but not W VH or W HH. See [30] for further dscusson. 4.2 Temporal restrcted Boltzmann machnes Sutskever and Hnton study a model related to a CRBM, whch they refer to as a temporal restrcted Boltzmann machne TRBM [37]. Whle a CRBM defnes the condtonal dstrbuton of the vsble and hdden values at tme t gven only the vsble values from tme t D to t 1, a TRBM defnes the correspondng condtonal probablty gven both the vsble values and the hdden values from tme t D to t 1. See Fgure 3a. Smlar to the CRBM, the parameters θ of the TRBM do not depend on tme t. Unlke the CRBM, the TRBM s not a condtonal RBM. Ths s because the TRBM wth a sngle parameter s used for every t, and the dstrbuton of hdden values s shared among those TRBM at varyng t. In partcular, hdden values of the TRBM can depend on the future vsble values. Because ths dependency makes learnng and nference hard, t s gnored n [37]. Namely, the values at each tme t s condtonally ndependent of the values after tme t gven the values at and before tme t. In partcular, the dstrbuton of the hdden values at tme t s completely determned by the vsble values up to tme t. The dstrbuton of the hdden values before tme t can thus be consdered as nput when we use the TRBM to defne the condtonal dstrbuton of the values at tme t see Fgure 3b. For each sampled values of hdden unts, TRBM n 3b s a CRBM. 6

7 Input Output Hdden Vsble Input Output Hdden Vsble t 2 t 1 t t 2 t 1 t a TRBM b TRBM wth approxmatons n [37] Fgure 3: Temporal restrcted Boltzmann machnes. In b, the gray crcles ndcate that expected values are used for the nput hdden unts. Furthermore, n [37], the expected values see Secton 5.4 from [28] are used for the hdden values. Then the nput hdden unts n Fgure 3b takes real values n [0, 1] that are completely determned by the vsble values before tme t. More formally, wth the approxmatons n [37], the TRBM wth parameter θ defnes the probablty dstrbuton over the tme-seres of vsble and hdden values as follows: T P θ x = P θ x [t], h [t] x [t D,t 1], r [t D,t 1], 26 t=0 h [t] where P θ x [t], h [t] x [t D,t 1], r [t D,t 1] = exp x [t] exp E θ x [t], h [t] x [t D,t 1], r [t D,t 1] E θ x [t], h [t] x [t D,t 1], r [t D,t 1] 27 s the condtonal dstrbuton defned by the Boltzmann machne shown n Fgure 3b, where r [t D,t 1] are expected hdden values. Specfcally, 27 s used to compute r [t] = E θ [H [t] x [0,t] ], 28 whch s subsequently used wth 27 for t t + 1, where the expectaton n 28 s wth respect to P θ h [t] x [0,t 1] = P θx [t], h [t] x [t D,t 1], r [t D,t 1]. 29 P θ x [t], h [t] x [t D,t 1], r [t D,t 1] x [t] Notce that r [t] can be computed from x [0,t] n a determnstc manner wth dependency on θ. However, ths dependency on θ s gnored n learnng TRBMs. 4.3 Recurrent temporal restrcted Boltzmann machnes To overcome the ntractablty of the TRBM wthout approxmatons, Sutskever et al. study a refned model of TRBM, whch they refer to as a recurrent temporal restrcted Boltzmann machne RTRBM 7

8 Input Output Hdden Vsble t 1 t Fgure 4: A recurrent temporal restrcted Boltzmann machne. [38]. The RTRBM smplfes the TRBM by removng connectons between vsble layers and connectons between hdden layers that are separated by more than one lag. Ths means that the vsble and hdden values at tme t are condtonally ndependent of the the vsble values before tme t and the hdden values before tme t 1 gven the hdden values at tme t 1. Smlar to the approxmaton made for the TRBM n Fgure 3b, the RTRBM uses the expected values for the hdden values at tme t 1 but defnes the condtonal dstrbuton of the vsble and hdden values at tme t over ther bnary values. See Fgure 4. More formally, let r [t 1] denote the expected values of the hdden unts at tme t 1: r [t 1] E θ [ H [t 1] x [0,t 1]], 30 where H [t 1] s the random vector representng the hdden values at tme t, and E θ [ x [0,t 1] ] represents the condtonal expectaton gven the vsble values up to tme t 1. The probablty dstrbuton of the values at tme t s then gven by exp P θ x [t], h [t] r [t 1] = exp where the condtonal energy s gven by for parameters x, h E θ x [t], h [t] r [t 1] E θ x [t], h [t] r [t 1], 31 E θ x [t], h [t] r [t 1] b V x [t] b H h [t] r [t 1] U h [t] x [t] W h [t] 32 θ b V, b H, W, U, 33 where b V s the bas for vsble unts, b H s the bas for hdden unts, W s the weght matrx between vsble unts and hdden unts, and U s the weght matrx between prevous expected value of hdden unts at t 1 and hdden unts at t. 8

9 4.3.1 Inference The margnal condtonal dstrbuton of vsble values at tme t 1 can then be represented as follows: exp F θ x [t] r [t 1] P θ x [t] r [t 1] =, 34 exp x [t] F θ x [t] r [t 1] where the condtonal free-energy s gven by F θ x [t] r [t 1] log exp E θ x h [t], h [t] r [t 1]. 35 [t] Once the vsble values at tme t s gven, the condtonal probablty dstrbuton of the hdden values at tme t can be represented as follows: exp E θ h [t] r [t 1], x [t] P θ h [t] r [t 1], x [t] = h [t] exp E θ h [t] r [t 1], x [t], 36 where the b V x [t] term s canceled out between the numerator and the denomnator, and the condtonal energy s gven by E θ h [t] r [t 1], x [t] b H + U r [t 1] + W x [t] h [t]. 37 By Corollary 1 from [28], the hdden values at tme t are condtonally ndependent of each other gven r [t 1] and x [t] : P θ h [t] r [t 1], x [t] = P θ h [t] r [t 1], x [t], 38 where P θ h [t] r [t 1], x [t] = exp b [t] h [t] 1 + exp b [t], 39 where b [t] s the -th element of b [t] b H + U r [t 1] + W x [t] 40 for t 1, and b [0] b nt + W x [0]. 41 where we now follow [38] and allow the hdden unts at tme 0 to have own bas b nt that can dffer from b H. The expected values are thus gven by r [t] = where the operatons are defned elementwse exp b [t], 42 9

10 r t h t x t t = 0 t = 1 t = T Fgure 5: A recurrent temporal restrcted Boltzmann machne unfolded through tme, where T = Learnng The parameters of an RTRBM can be traned through back propagaton through tme, analogous to recurrent neural networks, but wth contrastve dvergence. To understand how we can tran RTRBMs, Fgure 5 shows an RTRBM unfolded through tme. Recall that the expected values of hdden unts are determnstcally updated from r [t 1] to r [t] accordng to Hence, r [t] can be understood as hdden values of a recurrent neural network RNN [34]. An RTRBM can then be seen as an RNN but gves an RBM as an output nstead of real values, whch would be gven as an output from the standard RNN. We wll derve the learnng rule of the RTRBM, closely followng [25] but usng our notatons. When an RTRBM s unfolded through tme, ts energy can be represented as follows: E θ x, h = T b V x [t] b nt h [0] t=0 T b H h [t] t=1 T x [t] W h [t] t=0 T r [t 1] U h [t]. By 5, we can maxmze the log-lkelhood of a gven tme-seres x wth a gradent-based approach. What we need n 5 s the gradent of the energy wth respect to the parameter. A caveat s that the energy n 43 depends on r [ ], whch n turn depends on θ n a recursve manner. Also, expectaton wth respect to P t heta n 5 needs to be computed wth approxmaton such as contrastve dvergence see Secton 5.2. We frst study the last term of 43. Let Q s t=1 43 T r [t 1] U h [t] 44 t=s = r [s 1] U h [s] + Q s+1, 45 for s [1, T ], where Q T +1 0, so that Q Q 1 s the last term of 43. Takng the partal dervatve 10

11 wth respect to r [s 1], we obtan Q s = r [s 1] r [s 1] = U,: h [s] + r [s 1] U h [s] + r [s] 1 r[s] u, r [s] r [s 1] Q s+1 r [s] 46 Q s+1, 47 r [s] where U,: denotes the -th row of U, u, denotes the, -th element of U, and the last equalty follows from In vector-matrx notatons, we can wrte r [s 1]Q s = U h [s] + r [s] 1 r [s] r [s]q s+1, 48 where denotes elementwse multplcaton. Because Q s s not a functon of r [0],, r [s 2], we have r [s 1]Q = r [s 1]Q s. 49 Therefore, the partal dervatve of Q wth respect to r [s 1] s gven recursvely as follows: for s = 1,, T and r [s 1]Q = U h [s] + r [s] 1 r [s] r [s]q 50 r [T ]Q = We now take the dervatve of Q wth respect to the parameters n θ, startng wth U: dq du, = = T r [t] k u, t=0 T t=1 k Q r [t] k r [t] 1 r [t] r[t 1] + Q u, 52 Q r [t] + T t=1 r [t 1] h [t], 53 where the last equalty follows from In vector-matrx notatons, we can wrte U Q = T r [t 1] r [t] 1 r [t] r [t]q + h [t], 54 t=1 where r [t]q s gven by The gradent of Q wth respect to other parameters can be derved as follows: W Q = b HQ = T x [t] r [t] 1 r [t] r [t]q 55 t=0 T r [t] 1 r [t] r [t]q 56 t=1 b ntq = r [0] 1 r [0] r [0]Q 57 b VQ =

12 The gradents of Q can be used to show the followng gradents of the energy: U E θ x, h = W E θ x, h = b HE θ x, h = T r [t 1] r [t] 1 r [t] r [t]q + h [t] t=1 T x [t] h [t] t=0 T h [t] t=1 59 T x [t] r [t] 1 r [t] r [t]q 60 t=0 T r [t] 1 r [t] r [t]q 61 t=1 b nte θ x, h = h [0] r [0] 1 r [0] r [0]Q 62 T b VE θ x, h = x [t], 63 t=0 where r [t]q s gven by The gradent of the log-lkelhood of the gven tme-seres x now follows from 72 n [28] Extensons The RTRBM has been extended n varous ways. Mttleman et al. study a structured RTRBM, where unts are parttoned nto blocks, and only the connectons between partcular blocks are allowed [25]. Lyu et al. replaces the RNN of RTRBM wth the one wth Long Short-Term Memory LSTM [22]. Schrauwen and Buesng replaces the RNN of RTRBM wth an echo state network [36]. An RNN-RBM slghtly generalzes RTRBM by relaxng the constrant of the RTRBM that r [t] must be the expected value of h [t] [7]. Namely, an RNN-RBM s an RNN but gves an RBM as an output, where the RNN and RBM do not share parameters, whle an RTRBM shares parameters between an RNN and an RBM. 5 Dynamc Boltzmann machnes BPTT s not desrable for onlne learnng, where we update θ every tme a new pattern x [t] s observed. The per-step computatonal complexty of BPTT n onlne learnng grows lnearly wth the length of the precedng tme-seres. Such onlne learnng, however, s needed for example when we cannot store all observed patterns n memory or when we want to adapt to changng envronment. The dynamc Boltzmann machne DyBM s proposed as a tme-seres model that allows effcent onlne learnng [31, 32]. The per-step computatonal complexty of the learnng rule of a DyBM s ndependent of the length of the precedng tme-seres. In Secton 5.1, we start by revewng the DyBM ntroduced n [31, 32] wth relaton of ts learnng rule to spke-tmng dependent plastcty STDP. In Secton 5.2, we study the relaxaton of some of the constrants that the DyBM has requred n [31, 32] n a way that t becomes more sutable for nference and learnng [27]. The prmary purpose of these constrants n [31, 32] was to mmc a partcular form of STDP. The relaxed DyBM generalzes the orgnal DyBM and allows us to nterpret t as a form of logstc regresson for tme-seres data. In Secton 5.3, we revew DyBMs dealng wth real-valued tme-seres [27, 8]. These DyBMs are analogous to how Gaussan Boltzmann machnes [23, 43, 13] deal wth real-valued patterns as opposed to Boltzmann machnes [2, 14] for bnary values. The Gaussan DyBM can be related to a vector autoregressve VAR model [21]. Specfcally, we show that a specal case of the Gaussan DyBM s a VAR model havng addtonal varables that capture long term dependency of tme-seres. These addtonal varables correspond to DyBM s elgblty traces, whch represent how recently and frequently spkes arrved from a neuron to another. We also revew an extenson of the Gaussan DyBM to deal wth tme-seres patterns n contnuous space [17]. 12

13 W [d] Fgure 6: A dynamc Boltzmann machne unfolded through tme Fgure 1c from [31]. Some of the models and learnng algorthms n ths secton have been mplemented n Python or Java and open-sourced at Dynamc Boltzmann machnes for bnary-valued tme-seres Fnte dynamc Boltzmann machnes 3 The DyBM n [31, 32] s defned as a lmt of a sequence of Boltzmann machnes DyBM-T consstng of T layers as T tends to nfnty see Fgure 6. Formally, the DyBM-T s defned as the CRBM see Secton 3.1 havng T layers of N vsble unts T 1 layers of nput unts and one layer of output unts and no hdden unts, so that ts condtonal energy s defned as T 1 E θ x [t] x [t T +1,t 1] = b x [t] x [t δ] W [δ] x [t], 64 where the weght of the DyBM-T W [1],, W [T 1] assumes a partcular parametrc form wth a fnte number of parameters that are ndependent of T, whch we dscuss n the followng. The parametrc form of the weght n the DyBM-T s motvated by observatons from bologcal neural networks [1] but leads to partcularly smple, exact, and effcent learnng rule. In bologcal neural networks, STDP has been postulated and supported expermentally. In partcular, the weght from a pre-synaptc neuron to a post-synaptc neuron s strengthened, f the post-synaptc neuron fres generates a spke shortly after the pre-synaptc neuron fres.e., long term potentaton or LTP. Ths weght s weakened, f the post-synaptc neuron fres shortly before the pre-synaptc neuron fres.e., long term depresson or LTD. These dependency on the tmng of spkes s mssng n the Hebban rule for the Boltzmann machne see 36 from [28]. To have a learnng rule wth the characterstcs of STDP wth LTP and LTD, the DyBM-T assumes the weght of the form llustrated n Fgure 7. For δ > 0, we defne the weght, w [δ], weghts, ŵ [δ], and w[ δ], : 3 Ths secton closely follows [31]., as the sum of two w [δ], = ŵ[δ], + ŵ[ δ],, 65 13

14 weght ^ [d] W - d W [d] d d ^ [-d] W Fgure 7: The fgure llustrates Equaton 65 wth partcular forms of Equaton 66 Fgure 2 from [31]. The horzontal axs represents δ, and the vertcal axs represents the value of w [δ], sold curves, ŵ [δ], dashed curves, or ŵ[ δ], dotted curves. Notce that w [δ], s defned for δ > 0 and s dscontnuous at δ = d. On the other hand, ŵ [δ], and ŵ[ δ], are defned for < δ < and dscontnuous at δ = d, and δ = d,, respectvely, where recall that we assume d, = d, = d n ths paper. where ŵ [δ], = 0 f δ = 0 u, λ δ d f δ d v, µ δ otherwse. 66 for λ, µ [0, 1. For smplcty, we assume a sngle decay rate λ for δ d and a sngle decay rate µ for δ < d, as opposed to multple ones n [31, 32]. For smplcty, we assume that the conducton delay d s unform for all connectons, as opposed to varable conducton delay n [32]. See also [9, 29] for ways to tune the values of the conducton delay. In Fgure 7, the value of ŵ [δ], s hgh when δ = d, the conducton delay from -th pre-synaptc unt to the -th post-synaptc unt. Namely, the post-synaptc neuron s lkely to fre.e., x [0] = 1 mmedately after the spke from the pre-synaptc unt arrves wth the delay of d.e., x [ d] = 1. Ths lkelhood s controlled by the LTP weght u,. The value of ŵ [δ], gradually decreases, as δ ncreases from d. That s, the effect of the stmulus of the spke arrved from the -th unt dmnshes wth tme [1]. The value of ŵ [d 1], s low, suggestng that the post-synaptc unt s unlkely to fre.e., x [0] = 1 mmedately before the spke from the -th pre-synaptc unt arrves. Ths unlkelhood s controlled by the LTD weght v,. As δ decreases from d 1, the magntude of ŵ [δ], gradually decreases [1]. Here, δ can get smaller than 0, and ŵ [δ], wth δ < 0 represents the weght between the spke of the pre-synaptc neuron that s generated after the spke of the post-synaptc neuron. 14

15 Neural elgblty trace: γ t FIFO queue Synaptc elgblty trace: α t Pre-synaptc neuron x t x t 1 x t 2 x t d+1 x t Post-synaptc neuron Fgure 8: A connecton from a pre-synaptc neuron to a post-synaptc neuron n a DyBM Dynamc Boltzmann machne as a lmt of a sequence of fnte dynamc Boltzmann machnes 4 The DyBM s defned as a lmt of the sequence of DyBM-T as T. Because each DyBM-T s a CRBM, we can also defne the lmt of the sequence of the condtonal probablty defned by DyBM-T, and ths lmt s consdered as the condtonal probablty defned by the DyBM. Lkewse, the condtonal energy of the DyBM s defned as the lmt of the sequence of the condtonal energy of DyBM-T. Specfcally, as T, the condtonal energy of DyBM-T n 64 converges to E θ x [t] x [<t] = b x [t] x [t d] W [d] x [t], 67 where the convergence s due to the parametrc form 66. Ths condtonal energy n turn defnes the condtonal dstrbuton va 9, where we now have no hdden unts. Although the condtonal energy 67 of the DyBM nvolves an nfnte sum, t can be evaluated wth a fnte sum because of the parametrc form 66. In fact, the DyBM can be understood as an artfcal model of a spkng neural network where all computaton for nference and learnng s performed locally at each synapse usng only the nformaton avalable around the synapse. Specfcally, a pre-synaptc neuron s connected to a post-synaptc neuron va a frst-n-frst-out FIFO queue and a synapse see Fgure 8. At each dscrete tme t, a neuron ether fres x [t] = 1 or not x [t] = 0. The spke travels along the FIFO queue and reaches the synapse after conducton delay, d. In other words, the FIFO queue has the length of d 1 and stores, at tme t, the spkes that have been generated by the pre-synaptc neuron from tme t d + 1 to tme t 1. Each synapse n a DyBM stores a quantty called a synaptc elgblty trace. The value of the synaptc elgblty ncreases when a spke arrves at the synapse from the FIFO queue; otherwse, t s decreased by a constant factor. Specfcally, at tme t, the value of the synaptc elgblty trace, α [t], that s stored at the synapse from a pre-synaptc neuron s updated as follows: α [t] = λ α [t 1] d=1 + x [t d+1], 68 where λ s a decay rate and satsfes 0 λ < 1. Fgure 9 shows an example of how the value of the synaptc elgblty trace changes dependng on the spkes arrved at the synapse. Observe that α [t] represents how recently and frequently spkes arrved from a pre-synaptc neuron and can be represented non-recursvely as follows: α [t 1] = t d s= λ t s d x [s]

16 2.5 Elgblty and spke values Tme DyBM step Fgure 9: The value of a synaptc or neural elgblty trace as a functon of tme. For a synaptc elgblty trace at a synapse, the bars represent the spkes arrved from a FIFO queue at that synapse. For a neural elgblty trace at a neuron, the bars represent the spkes generated by that neuron. Each neuron n a DyBM stores a quantty called a neural elgblty trace 5. The value of the neural elgblty ncreases when the neuron fres; otherwse, t s decreased by a constant factor. Specfcally, at tme t, the value of the neural elgblty trace, γ [t], at a neuron s updated as follows: γ [t] = µ γ [t 1] + x [t], 70 where µ s a decay rate and satsfes 0 µ < 1. Observe that γ [t] represents how recently and frequently the neuron has fred and can be represented non-recursvely as follows: γ [t 1] = t 1 s= µ t s x [s] 71 A neuron n a DyBM fres accordng to the probablty dstrbuton that depends on the energy of the DyBM. A neuron s more lkely to fre when the energy becomes lower f t fres than otherwse. [t] Let E θ, x x[<t] be the energy assocated wth a neuron at tme t, whch can depend on whether fres at tme t.e., x [t] as well as the precedng spkng actvtes of the neurons n the DyBM.e., x [<t]. The frng probablty of a neuron s then gven by for x [t] P θ, x [t] x[<t] = exp E θ, x [t] x[<t] exp E θ, x x [<t] 72 x {0,1} {0, 1}. Specfcally, E θ, x [t] x[<t] can be represented as follows: [t] E θ, x x[<t] = b x [t] + Eθ, LTP [t] x x[<t] + Eθ, LTD [t] x x[<t], 73 where b s the bas parameter of a neuron and represents how lkely spkes s more lkely to fre 4 Ths secton closely follows [27]. 5 We assume a sngle neural elgblty trace, as opposed to multple ones n [32], at each neuron. 16

17 f b has a large postve value, and we defne where β [t 1] from to : Eθ, LTP [t] x x[<t] Eθ, LTD [t] x x[<t] N =1 N =1 u, α [t 1] x [t] 74 v, β [t 1] x [t] + N k=1 v,k γ [t 1] k x [t], 75 represents how soon and frequently spkes wll arrve at the synapse from the FIFO queues β [t 1] t 1 s=t d+1 µ s t x [s]. 76 Although β [t 1] can also be represented n a recursve manner, recursvely computed β [t 1] s prone to numercal nstablty. In 74, the summaton wth respect to s over all of the pre-synaptc neurons that are connected to. Here, u, s the weght parameter from to and represents the strength of Long Term Potentaton LTP. Ths weght parameter s thus referred to as LTP weght. A neuron s more lkely to fre x [t] = 1 when α [t 1] s large for a pre-synaptc neuron connected to spkes have recently arrved at from and the correspondng u, s postve and large LTP from to s strong. In 75, the summaton wth respect to s over all of the pre-synaptc neurons that are connected to, and the summaton wth respect to k s over all of the post-synaptc neurons whch s connected to. Here, v, represents the strength of Long Term Depresson from to and referred to as LTD weght. The neuron s less lkely to fre when β s large for a pre-synaptc neuron connected to spkes wll soon and frequently reach from and the correspondng v, s postve and large LTD from to s strong. The second term n 75 represents that a pre-synaptc neuron s less lkely to fre f a post-synaptc neuron has recently and frequently fred γ k s large, and the strength of ths LTD s gven by v,k. Notce that the tmng of a spke s measured wth respect to when the spke reaches synapse, where the spke from a pre-synaptc neuron has the delay d, and the spke from a post-synaptc neuron reaches mmedately. The learnng rule of the DyBM has been derved n a way that t maxmzes the log lkelhood of gven tme-seres wth respect to the probablty dstrbuton gven by 72 [32]. Specfcally, at tme t, the DyBM updates ts parameters accordng to b b + η x [t] u, u, + η α [t 1] v, v, + η β [t 1] E θ, [X [t] x [<t] ] 77 [t] x E θ, [X [t] x [<t] ] 78 Eθ, [X [t] x [<t] ] x [t] [t 1] [t] + η γ X x [t] 79 for each of neurons and, where η s a learnng rate, x [t] s the tranng data gven to at tme t, and E θ, [X [t] x [<t] ] denotes the expected value of x [t].e., frng probablty of a neuron at tme t accordng to the probablty dstrbuton gven by 72. By followng stochastc gradent methods [6, 18, 10, 41, 33], the learnng rate η may be adusted over tme t Relaton to spke-tmng dependent plastcty 6 In spke-tmng dependent plastcty STDP, the amount of the change n the weght between two neurons that fred together depends on the precse tmngs when the two neurons fred. STDP supplements the Hebban rule [11] and has been expermentally confrmed n bologcal neural networks [5]. 6 Ths secton closely follows [27]. 17

18 In 78, u, s ncreased LTP gets stronger when x [t] = 1 s gven to. Then becomes more lkely to fre when spkes from have recently and frequently arrved at.e., α [ ] s large. Ths, exhbtng a key property of STDP. In partcular, u, s ncreased by a large amount f spkes from have recently and frequently arrved at. Accordng to the second term on the rght-hand sde of 79, v, s ncreased LTD gets stronger amount of the change n u, depends on α [t 1] when x [t] = 0 s gven to a post-synaptc neuron. Then becomes less lkely to fre when spkes from are expected to reach soon.e., β [ ] s large. Ths amount of the change n v, s large f there are spkes n the FIFO queue from to and they are close to. Accordng to the last term of 79, v, s ncreased when x [t] = 0 s gven to the pre-synaptc, and ths amount of the change n v, s proportonal to γ.e., how frequently and recently the post-synaptc has fred. Ths learnng rule of 79 thus exhbts some of the key propertes of LTD wth STDP. In 77, b s ncreased when x [t] = 1 s gven to, so that becomes more lkely to fre n accordance wth the tranng data, but the amount of the change n b s small f s already lkely to fre E θ, [X [t] x [<t] ] 1. Ths dependency on E θ, [X [t] x [<t] ] can be consdered as a form of homeostatc plastcty [42, 20]. Related work There has been a sgnfcant amount of the pror work towards understandng STDP from the perspectves of machne learnng [26, 4, 35]. For example, Nessler et al. show that STDP can be understood as approxmatng the expectaton maxmzaton EM algorthm [26]. Nessler et al. study a partcularly structured wnner-take-all network and ts learnng rule for maxmzng the log lkelhood of gven statc patterns. On the other hand, the DyBM does not assume partcular structures n the network, and the learnng rule havng the propertes of STDP apples for any synapse n the network. Also, the learnng rule of the DyBM maxmzes the log lkelhood of gven tme-seres, and ts learnng rule does not nvolve approxmatons beyond what s assumed n stochastc gradent methods [6]. 5.2 Gvng flexblty to the DyBM 7 It has been shown n [32] that the DyBM n Secton 5.1 has the capablty of assocatve memory and anomaly detecton for sequental patterns, but the applcatons of the DyBM have been lmted to smple tasks wth relatvely low dmensonal tme-seres. In [27], we relax some of the constrants of ths DyBM n a way that t gves more flexblty that s useful for learnng and nference. Specfcally, observe that the frst term on the rght-hand sde of 75 can be rewrtten wth the defnton of β [t 1] n 76 as follows: N =1 v, β [t 1] x [t] = where we let v [δ], v, µ δ. Here, v [δ], = N t 1 =1 s=t d+1 N d 1 v [δ], x[t δ] =1 v, µ s t x [s] x [t] 80 x [t], 81 represents how unlkely fres at tme t f fred at tme t δ. The parametrc form of v [δ], v, µ δ assumes that ths LTD weght decays geometrcally as the nterval, δ, between the two spkes ncreases. In the followng, we relax ths constrant on v [δ], for δ = 1,, d 1 and assumes that these LTD weghts can take ndependent values. Then the energy of the DyBM wth N neurons can be represented 7 Ths secton closely follows [27]. 18

19 convenently wth matrx and vector operatons: E θ x [t] x [<t] N =1 E θ, x [t] x[<t] 82 d 1 b x [t] α [t 1] λ U x [t] + x [t δ] V [δ] x [t] + x [t] V γ µ [t 1], 83 where b b =1,...,N s a vector, U u,, {1,...,N} 2 s a matrx, and other boldface letters are defned analogously a vector s lowercase and a matrx s uppercase. For elgblty traces α [t 1] λ and γ µ [t 1], we append the subscrpt to explctly represent the dependency on the decay rate λ and µ. The functonal form of the energy completely determnes the dynamcs of a DyBM, and relaxng ts constrants allows the DyBM to represent a wder class of dynamcal systems. Notce that the last term of 83 can be dvded nto two terms: x [t] V γ [t 1] µ = γ [t 1] µ V x [t] 84 d 1 = α [t 1] µ V x [t] + x [t δ] ˆV [δ] x [t], 85 where α [t 1] µ s the same as the vector of synaptc elgblty traces but wth the decay rate µ, and ˆV [δ] µ δ V. Comparng 85 and 83, we fnd that, wthout loss of generalty, the energy of the DyBM can be represented wth the followng form: E θ x [t] x [<t] = b + d 1 x [t δ] W [δ] + L l=1 α [t 1] λ l U [l] x [t], 86 where we defne W [δ] = V [δ] ˆV [δ]. The energy n 86 reduces to the orgnal energy n 73 when W [δ] = µ δ V µ δ V, U [1] = U, U [2] = µ d V, λ 1 = λ, λ 2 = µ, and L = 2. Wth L > 2, one can also ncorporate multple synaptc or neural elgblty traces wth varyng decay rates n [32]. Equvalently, we can represent the energy usng neural elgblty traces, γ µl, nstead of synaptc elgblty traces, α λl, as follows: E θ x [t] x [<t] = b + d 1 x [t δ] W [δ] Learnng rule n vector-matrx notatons L l=1 γ [t 1] The learnng rule correspondng to the representaton wth 86 s as follows: µ l V l x [t]. 87 b b + η x [t] E θ [X [t] x [<t] ] 88 W [δ] W [δ] + η x [t δ] x [t] E θ [X [t] x [<t] ] 89 U [l] U [l] + η α [t 1] λ l x [t] E θ [X [t] x [<t] ] 90 for each δ and each l, where E θ [X [t] x [<t] ] s the condtonal expectaton wth respect to P θ x [t] x [<t] = exp E θx [t] x [<t]. 91 exp E θ x [t] x [<t] x [t] 19

20 Specfcally, E θ [X [t] x [<t] ] = expm[t] 1 + expm [t] 92 wth d 1 m [t] b + x [t δ] W [δ] + L l=1 α [t 1] λ l U [l], 93 where exponentaton and dvson of vectors are elementwse. The form of 92 mples that the DyBM s a knd of a logt model, where the feature vector, x [t d+1],, x [t 1], α [t 1] λ, α µ [t 1], depends on the pror values, x [<t], of the tme-seres. By applyng the learnng rules gven n to gven tme-seres, we can learn the parameters of the DyBM or equvalently the parameters of the logt model.e., b, W [δ] for δ = 1,, d 1, and U [l] for l = 1,, L n Dynamc Boltzmann machnes for real-valued tme-seres Gaussan dynamc Boltzmann machnes 8 In ths secton, we show how a DyBM can deal wth real-valued tme-seres n the form of a Gaussan DyBM [27, 8]. A Gaussan DyBM assumes that x [t] follows a Gaussan dstrbuton for each : p θ x[t] x[<t] = 1 2 π σ 2 [t] x exp m [t] 2 σ 2 2, 94 where m [t] s gven by 93, and σ 2 s a varance parameter. Ths Gaussan dstrbuton s n contrast to the Bernoull dstrbuton of the DyBM gven by 72. The condtonal energy of the Gaussan DyBM can be represented as follows: E θ x [t] x [<t] = = N =1 N =1 x [t] m [t] 2 2 σ 2 x [t] b 2 2 σ 2 d 1 N N =1 =1 x [t δ] w [δ], x[t] L N N l=1 =1 =1 95 α [t 1],λ l u [l], x[t] + C, 96 where C s the term that does not depend on x [t]. Because C s canceled out between the numerator and the denomnator n 9, we omt t from the condtonal energy. By lettng W σ [δ] be the matrx whose, element s w [δ], /σ2 and U[l] σ be the matrx whose, element s u [l], /σ2, the condtonal energy of the Gaussan DyBM can be represented as follows: E θ x [t] x [<t] = N =1 x [t] b 2 2 σ 2 x [t δ] W [δ] x [t] d 1 σ L α [t 1] λ l l=1 U [l] σ x [t]. 97 The condtonal energy of the Gaussan DyBM may be compared aganst the energy of the Gaussan Bernoull restrcted Boltzmann machne see [19] or 181 from [28]. 8 Ths secton closely follows [27]. 20

21 We now derve a learnng rule for the Gaussan DyBM n a way that t maxmzes the log-lkelhood of gven tme-seres x: log p θ x [t] x [<t] = t t N =1 log p x [t] x[,t 1], 98 where the summaton over t s over all of the tme steps of x, and the condtonal ndependence between x [t] and x [t] for gven x [<t] s the fundamental property of the DyBM as shown n [32]. The approach of stochastc gradent s to update the parameters of the Gaussan DyBM at each step, t, accordng to the gradent of the condtonal probablty densty of x [t] : log p θ x [t] x [<t] = N =1 1 2 log σ2 + x [t] m [t] 2 σ 2 2, 99 where the equalty follow from 94. From 99 and 93, we can derve the dervatve wth respect to each parameter as follows: log p θ x [t] x [,t 1] = x b log p θ x [t] x [,t 1] = x u, w [δ], [t] [t] log p θ x [t] x [,t 1] = x[t] µ [t] σ 2 µ [t] σ 2 µ [t] σ 2 σ log p θ x [t] x [,t 1] = 1 σ + x [t] where δ {1,, d 1}, l {1,, }, and, {1,, N}. These parameters are thus updated wth learnng rate η as follows: x [t] 100 α [t 1], 101 x [t δ] 102 µ [t] 2 σ 3, 103 b b + η x[t] m [t] σ x [t] m [t] 2 σ 2 σ σ + η σ W [δ] W [δ] + η x [t δ] x [t] m [t] 106 U [l] U [l] + η α [t 1] λ l σ 2 x [t] m [t] 107 where dvson and exponentaton of vectors are elementwse, and m [t] s gven by 93. The maxmum lkelhood estmator of x [t] by the Gaussan DyBM s gven by m [t] n 93. The Gaussan DyBM can thus be understood as a modfcaton to the standard vector autoregressve VAR model. Specfcally, the last term n the rght-hand sde of 93 nvolves elgblty traces, whch can be understood as features of hstorcal values, x,t d], and are added as new varables to the VAR model. Because the value of the elgblty traces can depend on the nfnte past, the Gaussan DyBM can take nto account the hstory beyond the lag d 1. σ 2 21

22 5.3.2 Natural gradents 9 In ths secton, we study a learnng rule based on natural gradent for the Gaussan DyBM. Consder a stochastc model that gves the probablty densty of a pattern x as p θ x. Wth natural gradents [3], the parameters, θ, of the stochastc model are updated as follows: θ t+1 = θ t η G 1 θ t log p θ x 108 at each step t, where η s the learnng rate, and Gθ denotes the Fsher nformaton matrx: Gθ p θ x log p θ x log p θ x dx. 109 Due to the condtonal ndependence n 98, t suffces to derve a natural gradent for each Gaussan unt. Here, we consder the parametrzaton wth mean m and varance v σ 2. The probablty densty functon of a Gaussan dstrbuton s represented wth ths parametrzaton as follows: The log lkelhood of x s then gven by px; m, v = 1 2π v exp x m2 2v. 110 x m2 log px; m, v = 1 2v 2 log v 1 log 2π Hence, the gradent and the nverse Fsher nformaton matrx n 108 are gven as follows: log p θ x = G 1 θ = x m v x m 2 2v 1 2 2v 1 v v 2 1 = The parameters θ t m t, v t are then updated as follows: 112 v 0 0 2v 2, 113 m t+1 = m t + η x m t 114 v t+1 = v t + η x m t 2 v t. 115 In the context of a Gaussan DyBM, the mean s gven by 93, where m [t] s lnear wth respect to b, w,, and u [l],. Also, the varance s gven by σ2. Hence, the natural gradent gves the learnng rules for these parameters as follows: b b + η x [t] m [t] 116 σ 2 σ 2 + η x [t] m [t] 2 σ W [δ] W [δ] + η x [t δ] x [t] m [t] 118 U [l] U [l] + η α [t 1] λ l x [t] m [t], 119 where the exponentaton of a vector s elementwse. We can compare aganst what the standard gradent gves n Ths secton closely follows [27]. 22

23 5.3.3 Usng nonlnear features n Gaussan DyBMs The Gaussan DyBM s a lnear model and has lmted capablty n modelng complex tme-seres. A way to take nto account non-lnear features of tme-seres wth a Gaussan DyBM s to apply nonlnear mappng to nput tme-seres and feed the resultng non-lnear features as addtonal nput to the Gaussan DyBM. An example of such non-lnear mappng s an echo state network ESN [16]. An ESN maps an nput sequence, x, nto ψ recursvely as follows: ψ [t] = 1 ρ ψ [t 1] + ρ tanh W rec ψ [t 1] + W n x [t], 120 where W rec and W n are randomly chosen and fxed parameters 10, and ρ s a leak parameter satsfyng 0 < ρ < 1. In 120, tanh s a hyperbolc tangent functon but may be replaced wth other nonlnear functons such as a sgmod functon. An elgblty trace may be consdered as a lnear counterpart of the nonlnear features created by an ESN. Because these features are generated by mappngs wth fxed parameters and ust gven as nput to a Gaussan DyBM, the learnng rules for the Gaussan DyBM stay unchanged. The nonlnear DyBM n [8] uses an ESN n a slghtly dfferent manner. 5.4 Functonal dynamc Boltzmann machnes We now revew a functonal DyBM, whch models tme-seres of functons patterns over a contnuous space Z [17]. Recall that a Gaussan DyBM defnes the condtonal dstrbuton of the next real-valued vector gven the precedng sequence of real-valued vectors. A functonal DyBM defnes the condtonal dstrbuton of the next functon.e., g [t] gven the precedng sequence of partal observatons of precedng functons. At each tme s, a set of ponts Z [s] z [s] =1,...,Ns s observed, where N s s the number of ponts that are observed at s. The functonal DyBM assumes that the condtonal dstrbuton of g [t] s gven by a Gaussan process, whose mean µ [t] vares over tme dependng on precedng functons as follows: d 1 µ [t] z = bz + Z w [δ] z, z g [t δ] z dz + L l=1 Z u l z, z α [t 1] l z dz 121 for x Z, where b s a functonal bas, w [δ], and u l, are functonal weght for each δ and for each l, and α [t 1] l = t d s= λ t s d l g [s] 122 s a functonal elgblty trace for each l. The covarance k σ 2, of the Gaussan process conssts of two components such that k σ 2z, z = kz, z + σ 2 δz, z, 123 where k, s a arbtrary kernel, δ, s a delta functon, and σ s a hyperparameter. For tractablty, Kano proposes partcular parametrzaton for the functonal bas and functonal weght [17]. Let P = p 1,, p M be a set of arbtrarly selected M ponts n Z 10 The spectral radus of W rec s set smaller than 1. bz = k σ 2z, P b 124 w [δ] z, z = k σ 2z, P W [δ] k σ 2P, z 125 u l z, z = k σ 2z, P U [l] k σ 2P, z

24 for each δ and each l, where k σ 2z, P k σ 2z, p =1,...,M 127 s a row vector, and k σ 2P, z k σ 2p, z =1,...,M 128 s a column vector. Because g [t] s n the reproducng kernel Hlbert space wth kernel k σ 2, substtutng nto 121 gves the followng expresson: θ z = k σ b 2z, P d 1 + W [δ] g [t δ] P + µ [t] L l=1 U [l] α [t 1] l P, 129 where g [t δ] P s a column vector wth -th element beng g [t δ] p, and the elgblty-trace vector P can be recursvely updated as follows: α [t 1] l α [t] l P = λ [t 1] l α l P + g [t d+1] P. 130 Here, we use θ b, W [1],, W [d 1], U [1],, U [L] to collectvely denote the parameters. Whle g [s] p for [1, M] s not observed, Kano uses a maxmum a posteror MAP estmator ĝ [s] p n [17]: ĝ [s] p = µ [s] θ p + kp, Z [t] k σ 2Z [t], Z [t] 1 g [t] Z [t] µ [t] θ Z[t], 131 where k σ 2Z [t], Z [t] s an N s N s matrx wth, -th element beng k σ 2z [t], z [t], and µ[t] θ Z[t] s a column vector defned analogously to g [t] Z [t]. The obectve of learnng a functonal DyBM s to maxmze the log lkelhood of observed values. The condtonal probablty densty of the functonal values of locatons Z [t] at tme t s gven by p θ g [t] Z [t] g [<t] exp 1 g [t] Z [t] µ [t] θ 2 Z[t] kσ 2Z [t], Z [t] 1 g [t] Z [t] µ [t] θ Z[t]. 132 The obectve s thus to maxmze fθ t f t θ, 133 where f t θ log p θ g [t] Z [t] g [<t] 134 = 1 g [t] Z [t] µ [t] θ 2 kσ 2Z [t], Z [t] 1 g [t] Z [t] µ [t] θ Z[t] + C, 135 where C s the term ndependent of θ. The gradent of f t θ s gven by f t θ = µ [t] θ Z[t] k σ 2Z [t], Z [t] 1 g [t] Z [t] µ [t] θ Z[t],

25 Z δ h t δ V δ h t 1 Z 1 V 1 h t x t δ U δ x t 1 U 1 W 1 x t b W δ Fgure 10: A dynamc Boltzmann machne wth hdden unts modfed Fgure 1 from [30]. where 129 gves µ [t] θ b Z[t] = k σ 2p, Z [t] 137 µ [t] θ Z[t] = k σ 2p, Z [t] g [t δ] p 138 w [δ], u [l], µ [t] θ Z[t] = k σ 2p, Z [t] α [t 1] l p 139 for each,, δ, l. The gradent mples the followng learnng rule wth stochastc gradent: b b + η k σ 2P, Z [t] k σ 2Z [t], Z [t] 1 g [t] Z [t] µ [t] θ Z[t] 140 W [δ] W [δ] + η k σ 2P, Z [t] k σ 2Z [t], Z [t] 1 g [t] Z [t] µ [t] θ Z[t] g [t δ] P 141 U [l] U [l] + η k σ 2P, Z [t] k σ 2Z [t], Z [t] 1 g [t] Z [t] µ [t] θ Z[t] α [t 1] l P, 142 where η s a learnng rate, and g [t δ] P s estmated wth the MAP estmator ĝ [t δ] P n Dynamc Boltzmann machnes wth hdden unts 11 In ths secton, we study a DyBM wth hdden unts see Fgure 10. Each layer of ths DyBM corresponds to a tme t δ for 0 δ < and has two parts: vsble and hdden. The vsble part x [t δ] at the δ-th layer represents the values of the tme-seres at tme t δ. The hdden part h [t δ] represents the values of hdden unts at tme t δ. Here, unts wthn each layer do not have connectons to each other. We let x [<t] x [s] s<t and defne h [<t] analogously. The Boltzmann machne n Fgure 10 has bas parameter b and weght parameter U, V, W, Z. Let θ V, W, b be the parameters connected to vsble unts x [t] from the unts n the past, x [s] or h [s] for s < t and φ U, Z. The condtonal energy of ths Boltzmann machne s gven as follows: E θ,φ x [t], h [t] x [<t], h [<t] = E θ x [t] x [<t], h [<t] + E φ h [t] x [<t], h [<t], Ths secton closely follows [30]. 25

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they