arxiv: v3 [cs.ne] 18 Jan 2019

Size: px
Start display at page:

Download "arxiv: v3 [cs.ne] 18 Jan 2019"

Transcription

1 Boltzmann machnes for tme-seres Takayuk Osogam IBM Research - Tokyo arxv: v3 [cs.ne] 18 Jan 2019 osogam@p.bm.com verson Abstract We revew Boltzmann machnes extended for tme-seres. These models often have recurrent structure, and back propagraton through tme BPTT s used to learn ther parameters. The perstep computatonal complexty of BPTT n onlne learnng, however, grows lnearly wth respect to the length of precedng tme-seres.e., learnng rule s not local n tme, whch lmts the applcablty of BPTT n onlne learnng. We then revew dynamc Boltzmann machnes DyBMs, whose learnng rule s local n tme. DyBM s learnng rule relates to spke-tmng dependent plastcty STDP, whch has been postulated and expermentally confrmed for bologcal neural networks. 1 Introducton The Boltzmann machne s a stochastc model for representng probablty dstrbutons over bnary patterns [28]. In ths paper, we revew Boltzmann machnes that have been studed as stochastc generatve models of tme-seres. Such Boltzmann machnes defne probablty dstrbutons over tme-seres of bnary patterns. They can also be modfed to deal wth tme-seres of real-valued patterns, smlar to Boltzmann machnes modfed for real-valued patterns e.g., Gaussan Boltzmann machnes; see Secton 6.3 from [28]. We wll follow the probablstc representatons of [28] for ntutve nterpretatons n terms of probabltes. In Secton 3, we start wth a Condtonal Restrcted Boltzmann Machne CRBM [40], whch s a condtonal Boltzmann machne Secton 4 from [28] that gves condtonal probablty of the next pattern gven a fxed number of precedng patterns. A lmtaton of a CRBM s that t can take nto account only the dependency wthn a fxed horzon, and as we ncrease the length of ths horzon, the complexty of learnng grows accordngly. To overcome ths lmtaton of CRBMs, researchers have proposed Boltzmann machnes havng recurrent structures, whch we revew n Secton 4. These nclude spkng Boltzmann machnes [12], temporal restrcted Boltzmann machnes TRBMs [37], recurrent temporal restrcted Boltzmann machnes RTRBMs [38], and extensons of those models. A standard approach to learnng those models havng recurrent structures s back propagaton through tme BPTT. However, BPTT s undesrable when we learn tme-seres n an onlne manner, where we update the parameters of a model every tme a new pattern arrves. Such onlne learnng s needed when we want to quckly adapt to a changng envronment or when we do not have suffcent memory to store the tme-seres. Unfortunately, the per-step computatonal complexty of BPTT n onlne learnng grows lnearly wth respect to the length of precedng tme-seres. Ths computatonal complexty lmts the applcablty of BPTT to onlne learnng. In Secton 5, we revew the dynamc Boltzmann machne DyBM [32, 31] and ts extensons. The DyBM s per-step computatonal complexty n onlne learnng s ndependent of the length of precedng 1

2 tme-seres. We dscuss how the learnng rule of the DyBM relates to spke-tmng dependent plastcty STDP, whch has been postulated and expermentally confrmed for bologcal neural networks. Ths survey paper s based on a personal note prepared for the thrd of the four parts of a tutoral gven at the 26th Internatonal Jont Conference on Artfcal Intellgence IJCAI-17 held n Melbourne, Australa on August 21, See a tutoral webpage 1 for nformaton about the tutoral. A survey correspondng to the frst part of the tutoral Boltzmann machnes and energy-based models can be found n [28]. We follow the defntons and notatons used n [28]. 2 Learnng energy-based models for tme-seres Consder a possbly mult-dmensonal tme-seres: x x [t] T t=0, 1 where x [t] denotes the bnary pattern vector at tme t. We wll use x [s,t] to denote the tme-seres of the patterns from tme s to t. A goal of learnng tme-seres s to maxmze the log-lkelhood of a gven tme-seres x or a collecton of multple tme-seres wth respect to the dstrbuton P θ defned by a model under consderaton, where we use θ to denote the set of the parameters of the model: fθ log P θ x = T log P θ x [t] x [0,t 1], 2 t=0 where P θ x [t] x [0,t 1] denotes the condtonal probablty that the pattern at tme t s x [t] gven that the patterns up to tme t 1 s x [0,t 1]. Here, P θ x [0] x [0, 1] denotes the probablty that the pattern at tme 0 s x [0], where x [0, 1] should be nterpreted as an empty hstory. We study models where the probablty s represented wth energy E θ as follows: P θ x = h P θ x, h, 3 where exp P θ x, h = exp x h E θ x, h E θ x, h, 4 the summaton wth respect to x s over all of the possble bnary tme-seres of length T, and the summaton wth respect to h s over all of the possble hdden values. The gradent of fθ can then be represented as follows see 73: fθ = E target [E θ [ E θ X, H X]] + E θ [ E θ X, H], 5 where X represents the random tme-seres, H represents the random hdden values, E θ denotes the expectaton wth respect to the model dstrbuton P θ, and E target denotes the expectaton wth respect to the target dstrbuton, whch n our case s the emprcal dstrbuton of tme-seres. When a sngle tme-seres x s gven as the target, 5 s reduced to fθ = E θ [ E θ x, H] + E θ [ E θ X, H], group.php?d=7834 2

3 where In other words, fθ can be maxmzed by maxmzng the sum of P θ x [t] x [0,t 1] = h f t θ log P θ x [t] x [0,t 1], 7 exp P θ x [t], h x [0,t 1] = exp x [t] P θ x [t], h x [0,t 1] 8 h E θ x [t], h x [0,t 1] E θ x [t], h x [0,t 1], 9 and E θ x [t], h x [0,t 1] s the condtonal energy of x [t], h gven x [0,t 1]. The gradent of f t θ s gven analogously to fθ: [ ]] [ ] f t θ = E target E θ [ E θ X [t], H X [t], x [0,t 1] + E θ E θ X [t], H x [0,t 1]. 10 When the target s a sngle tme-seres x, we have ] [ ] f t θ = E θ [ E θ x [t], H x [0,t 1] + E θ E θ X [t], H x [0,t 1] Non-recurrent Boltzmann machnes for tme-seres By 2, any model that can represent the condtonal probablty P θ x [t] x [0,t 1] can be used for tme-seres. In ths secton, we start wth a Boltzmann machne that can be used to model a D-th order Markov model for an arbtrarly determned D. In D-th order Markov models, the condtonal probablty can be represented as P θ x [t] x [0,t 1] = P θ x [t] x [t D,t 1] Condtonal restrcted Boltzmann machnes Fgure 1a shows a partcularly structured Boltzmann machne called Condtonal Restrcted Boltzmann Machne CRBM [40]. A CRBM represents the condtonal probablty on the rght-hand sde of 12. In the fgure, we set D = 2. The CRBM conssts of D + 1 layers of vsble unts and a layer of hdden unts. The unts wthn each layer have no connectons, but unts between dfferent layers may be connected to each other. Each vsble layer corresponds to a pattern at a tme s [t D, t]. The CRBM s a condtonal Boltzmann machne shown n Fgure 2c from [28] but wth a partcular structure to represent tme-seres. The vsble layers correspondng to x [t D,t 1] are the nput, and the vsble layer correspondng to x [t] s the output. The parameters θ of the CRBM are ndependent of t. More formally, the energy of a CRBM s gven by E θ x [t], h x [t D,t 1] = b V x [t] b H h h W HV x [t] where x [t] s output, x [t D,t 1] s nput, and h s hdden. We can then represent the condtonal probablty as follows see 14 from [28]: D x [t d] W [d] x [t], 13 d=1 P θ x [t] x [t D,t 1] = h P θ x [t], h x [t D,t 1], 14 3

4 Hdden Input Output Hdden Vsble Input Output Vsble t 2 t 1 t t 2 t 1 t a Sngle hdden layer b Multple hdden layers Fgure 1: Condtonal restrcted Boltzmann machnes. where P θ x [t], h x [t D,t 1] = exp E θ x [t], h x [t D,t 1] x [t] exp E θ x [t], h 15 x [t D,t 1], and the summaton wth respect to h s over all of the possble bnary hdden patterns, and the summaton wth respect to x [t] s defned analogously. One can then learn the parameters θ = b V, h h, W HV, W [1],, W D of the model by followng a gradent-based method n Secton 4 from [28]. 3.2 Extensons of condtonal restrcted Boltzmann machnes The CRBM has been extended n varous ways. Taylor et al. study a CRBM wth multple layers of hdden unts [40] see Fgure 1b. Memsevc and Hnton study a CRBM extended wth three-way nteractons.e., a hgher order Boltzmann machne, whch they refer to as a gated CRBM [24]. Specfcally, the energy of the gated CRBM nvolves,,k w,,k x y h k, 16 where x denotes nput values, y denotes output values, and h denotes hdden values. A drawback of the gated CRBM s ts ncreased number of parameters due to the three-way nteractons. Taylor and Hnton study a factored CRBM, where the three-way nteracton s represented wth a reduced number of parameters as follows [39]: w,f v w y,f wh k,f x y h k, 17 f,,k where the summaton wth respect to f s over a set of factors under consderaton. 4

5 hat these forward connectons are not requred f, but only for the purposes of extrapolatng nto 0.16 Input Output Hdden Vsble rt t 2 t 1 t Tme a Structure of a spkng Boltzmann machne b r used n [12] 1 Fgure 2: A spkng Boltzmann machne studed n [12]. In b, we use the fgure n the verson avalable at frtz/absps/nps00-ab.pdf. Fgure 3: The form of the temporal k 4 Boltzmann machnes for tme-seres wth recurrent structures 4.1 Spkng Boltzmann machnes A spkng Boltzmann machne studed n [12] can be shown to be essentally equvalent to the Boltzmann machne llustrated n Fgure 2. Ths Boltzmann machne conssts of nput unts, output unts, and hdden unts. The nput unts represent hstorcal values of vsble unts and hdden unts. Although hdden unts are random and cannot be smply gven as nput, Hnton and Brown make the approxmaton of usng sampled values H,t 1] ω as the nput hdden unts [12]. Specfcally, gven the vsble values x [<t] x,t 1] and sampled hdden values H [<t] ω up to tme t 1, the energy wth the vsble values x [t] and hdden values h [t] at tme t can be represented as follows: E θ x [t], h [t] x [<t], H [<t] ω = b V x [t] b H h [t] h [t] rτ W HV x [t] H [t τ] ω rτ W HH h [t] x [t τ] rτ W VH h [t] τ=1 H [t τ] ω rτ W HV x [t] x [t τ] rτ W VV x [t], 18 τ=1 where r s an arbtrarly chosen functon and s not the target of learnng. Namely, the Boltzmann machne has an nfnte number of unts but can be characterzed by a fnte number of parameters θ b V, b H, W VV, W VH, W HV, W HH. Fgure 2b shows the specfc r used n [12] 2. Notce that the Boltzmann machne n Fgure 2 can be seen as a restrcted Boltzmann machne RBM whose bas and weght can depend on x [<t] and H [<t] ω, because 18 can be represented as E θ x [t], h [t] x [<t], H [<t] ω = b H t, ω h [t] b V t, ω x [t] h [t] W x [t], 19 2 Although t s not clear from the descrptons n [12], the labels n the horzontal axs should probably be shfted by one, so that r0 = 0, r1 0.1, and so on. τ=1 τ=1 5

6 where b H t, ω s the tme-varyng bas for hdden unts, b V t, ω s the tme-varyng bas for vsble unts, and W s the weght between vsble unts and hdden unts: b H t, ω b H + H [t τ] ω rτ W HH + x [t τ] rτ W VH 20 τ=1 b V t, ω b V + H [t τ] ω rτ W HV + x [t τ] rτ W VV 21 τ=1 W r0 W HV. 22 We can then represent the condtonal probablty as follows: P θ x [t] x [<t], H,t 1 ω = h[t] τ=1 τ=1 P θ x [t], h [t] x [<t], H,t 1] ω 23 where P θ x [t], h [t] x [<t], H [<t] ω = exp E θ x [t], h [t] x [<t], H,t 1] ω exp E θ x [t], h [t] x [<t], H [<t]. 24 ω x [t] We now dscuss the choce of r0 = 0, whch appears to be the case n Fgure 2b. In ths case, the energy s reduced to E θ x [t], h [t] x [<t], H [<t] ω = b H t, ω h [t] b V t, ω x [t]. 25 Because there are no connectons between vsble unts and hdden unts at tme t, the hdden values at t do not affect the dstrbuton of the vsble values at t. The only role of the hdden unts s that the sampled hdden values are used to update the tme-varyng bas, b V s, ω and b H s, ω for s > t. A problem s that there s no mechansm that allows us to learn approprate values of W VH and W HH untl we observe succeedng vsble values. Namely, the hdden values h [t] are sampled wth the dependency on W VH and W HH, but whether the sampled hdden values are good or not can only be known when those hdden values are used as nput. Ths helps us to learn approprate values of W HV, but not W VH or W HH. See [30] for further dscusson. 4.2 Temporal restrcted Boltzmann machnes Sutskever and Hnton study a model related to a CRBM, whch they refer to as a temporal restrcted Boltzmann machne TRBM [37]. Whle a CRBM defnes the condtonal dstrbuton of the vsble and hdden values at tme t gven only the vsble values from tme t D to t 1, a TRBM defnes the correspondng condtonal probablty gven both the vsble values and the hdden values from tme t D to t 1. See Fgure 3a. Smlar to the CRBM, the parameters θ of the TRBM do not depend on tme t. Unlke the CRBM, the TRBM s not a condtonal RBM. Ths s because the TRBM wth a sngle parameter s used for every t, and the dstrbuton of hdden values s shared among those TRBM at varyng t. In partcular, hdden values of the TRBM can depend on the future vsble values. Because ths dependency makes learnng and nference hard, t s gnored n [37]. Namely, the values at each tme t s condtonally ndependent of the values after tme t gven the values at and before tme t. In partcular, the dstrbuton of the hdden values at tme t s completely determned by the vsble values up to tme t. The dstrbuton of the hdden values before tme t can thus be consdered as nput when we use the TRBM to defne the condtonal dstrbuton of the values at tme t see Fgure 3b. For each sampled values of hdden unts, TRBM n 3b s a CRBM. 6

7 Input Output Hdden Vsble Input Output Hdden Vsble t 2 t 1 t t 2 t 1 t a TRBM b TRBM wth approxmatons n [37] Fgure 3: Temporal restrcted Boltzmann machnes. In b, the gray crcles ndcate that expected values are used for the nput hdden unts. Furthermore, n [37], the expected values see Secton 5.4 from [28] are used for the hdden values. Then the nput hdden unts n Fgure 3b takes real values n [0, 1] that are completely determned by the vsble values before tme t. More formally, wth the approxmatons n [37], the TRBM wth parameter θ defnes the probablty dstrbuton over the tme-seres of vsble and hdden values as follows: T P θ x = P θ x [t], h [t] x [t D,t 1], r [t D,t 1], 26 t=0 h [t] where P θ x [t], h [t] x [t D,t 1], r [t D,t 1] = exp x [t] exp E θ x [t], h [t] x [t D,t 1], r [t D,t 1] E θ x [t], h [t] x [t D,t 1], r [t D,t 1] 27 s the condtonal dstrbuton defned by the Boltzmann machne shown n Fgure 3b, where r [t D,t 1] are expected hdden values. Specfcally, 27 s used to compute r [t] = E θ [H [t] x [0,t] ], 28 whch s subsequently used wth 27 for t t + 1, where the expectaton n 28 s wth respect to P θ h [t] x [0,t 1] = P θx [t], h [t] x [t D,t 1], r [t D,t 1]. 29 P θ x [t], h [t] x [t D,t 1], r [t D,t 1] x [t] Notce that r [t] can be computed from x [0,t] n a determnstc manner wth dependency on θ. However, ths dependency on θ s gnored n learnng TRBMs. 4.3 Recurrent temporal restrcted Boltzmann machnes To overcome the ntractablty of the TRBM wthout approxmatons, Sutskever et al. study a refned model of TRBM, whch they refer to as a recurrent temporal restrcted Boltzmann machne RTRBM 7

8 Input Output Hdden Vsble t 1 t Fgure 4: A recurrent temporal restrcted Boltzmann machne. [38]. The RTRBM smplfes the TRBM by removng connectons between vsble layers and connectons between hdden layers that are separated by more than one lag. Ths means that the vsble and hdden values at tme t are condtonally ndependent of the the vsble values before tme t and the hdden values before tme t 1 gven the hdden values at tme t 1. Smlar to the approxmaton made for the TRBM n Fgure 3b, the RTRBM uses the expected values for the hdden values at tme t 1 but defnes the condtonal dstrbuton of the vsble and hdden values at tme t over ther bnary values. See Fgure 4. More formally, let r [t 1] denote the expected values of the hdden unts at tme t 1: r [t 1] E θ [ H [t 1] x [0,t 1]], 30 where H [t 1] s the random vector representng the hdden values at tme t, and E θ [ x [0,t 1] ] represents the condtonal expectaton gven the vsble values up to tme t 1. The probablty dstrbuton of the values at tme t s then gven by exp P θ x [t], h [t] r [t 1] = exp where the condtonal energy s gven by for parameters x, h E θ x [t], h [t] r [t 1] E θ x [t], h [t] r [t 1], 31 E θ x [t], h [t] r [t 1] b V x [t] b H h [t] r [t 1] U h [t] x [t] W h [t] 32 θ b V, b H, W, U, 33 where b V s the bas for vsble unts, b H s the bas for hdden unts, W s the weght matrx between vsble unts and hdden unts, and U s the weght matrx between prevous expected value of hdden unts at t 1 and hdden unts at t. 8

9 4.3.1 Inference The margnal condtonal dstrbuton of vsble values at tme t 1 can then be represented as follows: exp F θ x [t] r [t 1] P θ x [t] r [t 1] =, 34 exp x [t] F θ x [t] r [t 1] where the condtonal free-energy s gven by F θ x [t] r [t 1] log exp E θ x h [t], h [t] r [t 1]. 35 [t] Once the vsble values at tme t s gven, the condtonal probablty dstrbuton of the hdden values at tme t can be represented as follows: exp E θ h [t] r [t 1], x [t] P θ h [t] r [t 1], x [t] = h [t] exp E θ h [t] r [t 1], x [t], 36 where the b V x [t] term s canceled out between the numerator and the denomnator, and the condtonal energy s gven by E θ h [t] r [t 1], x [t] b H + U r [t 1] + W x [t] h [t]. 37 By Corollary 1 from [28], the hdden values at tme t are condtonally ndependent of each other gven r [t 1] and x [t] : P θ h [t] r [t 1], x [t] = P θ h [t] r [t 1], x [t], 38 where P θ h [t] r [t 1], x [t] = exp b [t] h [t] 1 + exp b [t], 39 where b [t] s the -th element of b [t] b H + U r [t 1] + W x [t] 40 for t 1, and b [0] b nt + W x [0]. 41 where we now follow [38] and allow the hdden unts at tme 0 to have own bas b nt that can dffer from b H. The expected values are thus gven by r [t] = where the operatons are defned elementwse exp b [t], 42 9

10 r t h t x t t = 0 t = 1 t = T Fgure 5: A recurrent temporal restrcted Boltzmann machne unfolded through tme, where T = Learnng The parameters of an RTRBM can be traned through back propagaton through tme, analogous to recurrent neural networks, but wth contrastve dvergence. To understand how we can tran RTRBMs, Fgure 5 shows an RTRBM unfolded through tme. Recall that the expected values of hdden unts are determnstcally updated from r [t 1] to r [t] accordng to Hence, r [t] can be understood as hdden values of a recurrent neural network RNN [34]. An RTRBM can then be seen as an RNN but gves an RBM as an output nstead of real values, whch would be gven as an output from the standard RNN. We wll derve the learnng rule of the RTRBM, closely followng [25] but usng our notatons. When an RTRBM s unfolded through tme, ts energy can be represented as follows: E θ x, h = T b V x [t] b nt h [0] t=0 T b H h [t] t=1 T x [t] W h [t] t=0 T r [t 1] U h [t]. By 5, we can maxmze the log-lkelhood of a gven tme-seres x wth a gradent-based approach. What we need n 5 s the gradent of the energy wth respect to the parameter. A caveat s that the energy n 43 depends on r [ ], whch n turn depends on θ n a recursve manner. Also, expectaton wth respect to P t heta n 5 needs to be computed wth approxmaton such as contrastve dvergence see Secton 5.2. We frst study the last term of 43. Let Q s t=1 43 T r [t 1] U h [t] 44 t=s = r [s 1] U h [s] + Q s+1, 45 for s [1, T ], where Q T +1 0, so that Q Q 1 s the last term of 43. Takng the partal dervatve 10

11 wth respect to r [s 1], we obtan Q s = r [s 1] r [s 1] = U,: h [s] + r [s 1] U h [s] + r [s] 1 r[s] u, r [s] r [s 1] Q s+1 r [s] 46 Q s+1, 47 r [s] where U,: denotes the -th row of U, u, denotes the, -th element of U, and the last equalty follows from In vector-matrx notatons, we can wrte r [s 1]Q s = U h [s] + r [s] 1 r [s] r [s]q s+1, 48 where denotes elementwse multplcaton. Because Q s s not a functon of r [0],, r [s 2], we have r [s 1]Q = r [s 1]Q s. 49 Therefore, the partal dervatve of Q wth respect to r [s 1] s gven recursvely as follows: for s = 1,, T and r [s 1]Q = U h [s] + r [s] 1 r [s] r [s]q 50 r [T ]Q = We now take the dervatve of Q wth respect to the parameters n θ, startng wth U: dq du, = = T r [t] k u, t=0 T t=1 k Q r [t] k r [t] 1 r [t] r[t 1] + Q u, 52 Q r [t] + T t=1 r [t 1] h [t], 53 where the last equalty follows from In vector-matrx notatons, we can wrte U Q = T r [t 1] r [t] 1 r [t] r [t]q + h [t], 54 t=1 where r [t]q s gven by The gradent of Q wth respect to other parameters can be derved as follows: W Q = b HQ = T x [t] r [t] 1 r [t] r [t]q 55 t=0 T r [t] 1 r [t] r [t]q 56 t=1 b ntq = r [0] 1 r [0] r [0]Q 57 b VQ =

12 The gradents of Q can be used to show the followng gradents of the energy: U E θ x, h = W E θ x, h = b HE θ x, h = T r [t 1] r [t] 1 r [t] r [t]q + h [t] t=1 T x [t] h [t] t=0 T h [t] t=1 59 T x [t] r [t] 1 r [t] r [t]q 60 t=0 T r [t] 1 r [t] r [t]q 61 t=1 b nte θ x, h = h [0] r [0] 1 r [0] r [0]Q 62 T b VE θ x, h = x [t], 63 t=0 where r [t]q s gven by The gradent of the log-lkelhood of the gven tme-seres x now follows from 72 n [28] Extensons The RTRBM has been extended n varous ways. Mttleman et al. study a structured RTRBM, where unts are parttoned nto blocks, and only the connectons between partcular blocks are allowed [25]. Lyu et al. replaces the RNN of RTRBM wth the one wth Long Short-Term Memory LSTM [22]. Schrauwen and Buesng replaces the RNN of RTRBM wth an echo state network [36]. An RNN-RBM slghtly generalzes RTRBM by relaxng the constrant of the RTRBM that r [t] must be the expected value of h [t] [7]. Namely, an RNN-RBM s an RNN but gves an RBM as an output, where the RNN and RBM do not share parameters, whle an RTRBM shares parameters between an RNN and an RBM. 5 Dynamc Boltzmann machnes BPTT s not desrable for onlne learnng, where we update θ every tme a new pattern x [t] s observed. The per-step computatonal complexty of BPTT n onlne learnng grows lnearly wth the length of the precedng tme-seres. Such onlne learnng, however, s needed for example when we cannot store all observed patterns n memory or when we want to adapt to changng envronment. The dynamc Boltzmann machne DyBM s proposed as a tme-seres model that allows effcent onlne learnng [31, 32]. The per-step computatonal complexty of the learnng rule of a DyBM s ndependent of the length of the precedng tme-seres. In Secton 5.1, we start by revewng the DyBM ntroduced n [31, 32] wth relaton of ts learnng rule to spke-tmng dependent plastcty STDP. In Secton 5.2, we study the relaxaton of some of the constrants that the DyBM has requred n [31, 32] n a way that t becomes more sutable for nference and learnng [27]. The prmary purpose of these constrants n [31, 32] was to mmc a partcular form of STDP. The relaxed DyBM generalzes the orgnal DyBM and allows us to nterpret t as a form of logstc regresson for tme-seres data. In Secton 5.3, we revew DyBMs dealng wth real-valued tme-seres [27, 8]. These DyBMs are analogous to how Gaussan Boltzmann machnes [23, 43, 13] deal wth real-valued patterns as opposed to Boltzmann machnes [2, 14] for bnary values. The Gaussan DyBM can be related to a vector autoregressve VAR model [21]. Specfcally, we show that a specal case of the Gaussan DyBM s a VAR model havng addtonal varables that capture long term dependency of tme-seres. These addtonal varables correspond to DyBM s elgblty traces, whch represent how recently and frequently spkes arrved from a neuron to another. We also revew an extenson of the Gaussan DyBM to deal wth tme-seres patterns n contnuous space [17]. 12

13 W [d] Fgure 6: A dynamc Boltzmann machne unfolded through tme Fgure 1c from [31]. Some of the models and learnng algorthms n ths secton have been mplemented n Python or Java and open-sourced at Dynamc Boltzmann machnes for bnary-valued tme-seres Fnte dynamc Boltzmann machnes 3 The DyBM n [31, 32] s defned as a lmt of a sequence of Boltzmann machnes DyBM-T consstng of T layers as T tends to nfnty see Fgure 6. Formally, the DyBM-T s defned as the CRBM see Secton 3.1 havng T layers of N vsble unts T 1 layers of nput unts and one layer of output unts and no hdden unts, so that ts condtonal energy s defned as T 1 E θ x [t] x [t T +1,t 1] = b x [t] x [t δ] W [δ] x [t], 64 where the weght of the DyBM-T W [1],, W [T 1] assumes a partcular parametrc form wth a fnte number of parameters that are ndependent of T, whch we dscuss n the followng. The parametrc form of the weght n the DyBM-T s motvated by observatons from bologcal neural networks [1] but leads to partcularly smple, exact, and effcent learnng rule. In bologcal neural networks, STDP has been postulated and supported expermentally. In partcular, the weght from a pre-synaptc neuron to a post-synaptc neuron s strengthened, f the post-synaptc neuron fres generates a spke shortly after the pre-synaptc neuron fres.e., long term potentaton or LTP. Ths weght s weakened, f the post-synaptc neuron fres shortly before the pre-synaptc neuron fres.e., long term depresson or LTD. These dependency on the tmng of spkes s mssng n the Hebban rule for the Boltzmann machne see 36 from [28]. To have a learnng rule wth the characterstcs of STDP wth LTP and LTD, the DyBM-T assumes the weght of the form llustrated n Fgure 7. For δ > 0, we defne the weght, w [δ], weghts, ŵ [δ], and w[ δ], : 3 Ths secton closely follows [31]., as the sum of two w [δ], = ŵ[δ], + ŵ[ δ],, 65 13

14 weght ^ [d] W - d W [d] d d ^ [-d] W Fgure 7: The fgure llustrates Equaton 65 wth partcular forms of Equaton 66 Fgure 2 from [31]. The horzontal axs represents δ, and the vertcal axs represents the value of w [δ], sold curves, ŵ [δ], dashed curves, or ŵ[ δ], dotted curves. Notce that w [δ], s defned for δ > 0 and s dscontnuous at δ = d. On the other hand, ŵ [δ], and ŵ[ δ], are defned for < δ < and dscontnuous at δ = d, and δ = d,, respectvely, where recall that we assume d, = d, = d n ths paper. where ŵ [δ], = 0 f δ = 0 u, λ δ d f δ d v, µ δ otherwse. 66 for λ, µ [0, 1. For smplcty, we assume a sngle decay rate λ for δ d and a sngle decay rate µ for δ < d, as opposed to multple ones n [31, 32]. For smplcty, we assume that the conducton delay d s unform for all connectons, as opposed to varable conducton delay n [32]. See also [9, 29] for ways to tune the values of the conducton delay. In Fgure 7, the value of ŵ [δ], s hgh when δ = d, the conducton delay from -th pre-synaptc unt to the -th post-synaptc unt. Namely, the post-synaptc neuron s lkely to fre.e., x [0] = 1 mmedately after the spke from the pre-synaptc unt arrves wth the delay of d.e., x [ d] = 1. Ths lkelhood s controlled by the LTP weght u,. The value of ŵ [δ], gradually decreases, as δ ncreases from d. That s, the effect of the stmulus of the spke arrved from the -th unt dmnshes wth tme [1]. The value of ŵ [d 1], s low, suggestng that the post-synaptc unt s unlkely to fre.e., x [0] = 1 mmedately before the spke from the -th pre-synaptc unt arrves. Ths unlkelhood s controlled by the LTD weght v,. As δ decreases from d 1, the magntude of ŵ [δ], gradually decreases [1]. Here, δ can get smaller than 0, and ŵ [δ], wth δ < 0 represents the weght between the spke of the pre-synaptc neuron that s generated after the spke of the post-synaptc neuron. 14

15 Neural elgblty trace: γ t FIFO queue Synaptc elgblty trace: α t Pre-synaptc neuron x t x t 1 x t 2 x t d+1 x t Post-synaptc neuron Fgure 8: A connecton from a pre-synaptc neuron to a post-synaptc neuron n a DyBM Dynamc Boltzmann machne as a lmt of a sequence of fnte dynamc Boltzmann machnes 4 The DyBM s defned as a lmt of the sequence of DyBM-T as T. Because each DyBM-T s a CRBM, we can also defne the lmt of the sequence of the condtonal probablty defned by DyBM-T, and ths lmt s consdered as the condtonal probablty defned by the DyBM. Lkewse, the condtonal energy of the DyBM s defned as the lmt of the sequence of the condtonal energy of DyBM-T. Specfcally, as T, the condtonal energy of DyBM-T n 64 converges to E θ x [t] x [<t] = b x [t] x [t d] W [d] x [t], 67 where the convergence s due to the parametrc form 66. Ths condtonal energy n turn defnes the condtonal dstrbuton va 9, where we now have no hdden unts. Although the condtonal energy 67 of the DyBM nvolves an nfnte sum, t can be evaluated wth a fnte sum because of the parametrc form 66. In fact, the DyBM can be understood as an artfcal model of a spkng neural network where all computaton for nference and learnng s performed locally at each synapse usng only the nformaton avalable around the synapse. Specfcally, a pre-synaptc neuron s connected to a post-synaptc neuron va a frst-n-frst-out FIFO queue and a synapse see Fgure 8. At each dscrete tme t, a neuron ether fres x [t] = 1 or not x [t] = 0. The spke travels along the FIFO queue and reaches the synapse after conducton delay, d. In other words, the FIFO queue has the length of d 1 and stores, at tme t, the spkes that have been generated by the pre-synaptc neuron from tme t d + 1 to tme t 1. Each synapse n a DyBM stores a quantty called a synaptc elgblty trace. The value of the synaptc elgblty ncreases when a spke arrves at the synapse from the FIFO queue; otherwse, t s decreased by a constant factor. Specfcally, at tme t, the value of the synaptc elgblty trace, α [t], that s stored at the synapse from a pre-synaptc neuron s updated as follows: α [t] = λ α [t 1] d=1 + x [t d+1], 68 where λ s a decay rate and satsfes 0 λ < 1. Fgure 9 shows an example of how the value of the synaptc elgblty trace changes dependng on the spkes arrved at the synapse. Observe that α [t] represents how recently and frequently spkes arrved from a pre-synaptc neuron and can be represented non-recursvely as follows: α [t 1] = t d s= λ t s d x [s]

16 2.5 Elgblty and spke values Tme DyBM step Fgure 9: The value of a synaptc or neural elgblty trace as a functon of tme. For a synaptc elgblty trace at a synapse, the bars represent the spkes arrved from a FIFO queue at that synapse. For a neural elgblty trace at a neuron, the bars represent the spkes generated by that neuron. Each neuron n a DyBM stores a quantty called a neural elgblty trace 5. The value of the neural elgblty ncreases when the neuron fres; otherwse, t s decreased by a constant factor. Specfcally, at tme t, the value of the neural elgblty trace, γ [t], at a neuron s updated as follows: γ [t] = µ γ [t 1] + x [t], 70 where µ s a decay rate and satsfes 0 µ < 1. Observe that γ [t] represents how recently and frequently the neuron has fred and can be represented non-recursvely as follows: γ [t 1] = t 1 s= µ t s x [s] 71 A neuron n a DyBM fres accordng to the probablty dstrbuton that depends on the energy of the DyBM. A neuron s more lkely to fre when the energy becomes lower f t fres than otherwse. [t] Let E θ, x x[<t] be the energy assocated wth a neuron at tme t, whch can depend on whether fres at tme t.e., x [t] as well as the precedng spkng actvtes of the neurons n the DyBM.e., x [<t]. The frng probablty of a neuron s then gven by for x [t] P θ, x [t] x[<t] = exp E θ, x [t] x[<t] exp E θ, x x [<t] 72 x {0,1} {0, 1}. Specfcally, E θ, x [t] x[<t] can be represented as follows: [t] E θ, x x[<t] = b x [t] + Eθ, LTP [t] x x[<t] + Eθ, LTD [t] x x[<t], 73 where b s the bas parameter of a neuron and represents how lkely spkes s more lkely to fre 4 Ths secton closely follows [27]. 5 We assume a sngle neural elgblty trace, as opposed to multple ones n [32], at each neuron. 16

17 f b has a large postve value, and we defne where β [t 1] from to : Eθ, LTP [t] x x[<t] Eθ, LTD [t] x x[<t] N =1 N =1 u, α [t 1] x [t] 74 v, β [t 1] x [t] + N k=1 v,k γ [t 1] k x [t], 75 represents how soon and frequently spkes wll arrve at the synapse from the FIFO queues β [t 1] t 1 s=t d+1 µ s t x [s]. 76 Although β [t 1] can also be represented n a recursve manner, recursvely computed β [t 1] s prone to numercal nstablty. In 74, the summaton wth respect to s over all of the pre-synaptc neurons that are connected to. Here, u, s the weght parameter from to and represents the strength of Long Term Potentaton LTP. Ths weght parameter s thus referred to as LTP weght. A neuron s more lkely to fre x [t] = 1 when α [t 1] s large for a pre-synaptc neuron connected to spkes have recently arrved at from and the correspondng u, s postve and large LTP from to s strong. In 75, the summaton wth respect to s over all of the pre-synaptc neurons that are connected to, and the summaton wth respect to k s over all of the post-synaptc neurons whch s connected to. Here, v, represents the strength of Long Term Depresson from to and referred to as LTD weght. The neuron s less lkely to fre when β s large for a pre-synaptc neuron connected to spkes wll soon and frequently reach from and the correspondng v, s postve and large LTD from to s strong. The second term n 75 represents that a pre-synaptc neuron s less lkely to fre f a post-synaptc neuron has recently and frequently fred γ k s large, and the strength of ths LTD s gven by v,k. Notce that the tmng of a spke s measured wth respect to when the spke reaches synapse, where the spke from a pre-synaptc neuron has the delay d, and the spke from a post-synaptc neuron reaches mmedately. The learnng rule of the DyBM has been derved n a way that t maxmzes the log lkelhood of gven tme-seres wth respect to the probablty dstrbuton gven by 72 [32]. Specfcally, at tme t, the DyBM updates ts parameters accordng to b b + η x [t] u, u, + η α [t 1] v, v, + η β [t 1] E θ, [X [t] x [<t] ] 77 [t] x E θ, [X [t] x [<t] ] 78 Eθ, [X [t] x [<t] ] x [t] [t 1] [t] + η γ X x [t] 79 for each of neurons and, where η s a learnng rate, x [t] s the tranng data gven to at tme t, and E θ, [X [t] x [<t] ] denotes the expected value of x [t].e., frng probablty of a neuron at tme t accordng to the probablty dstrbuton gven by 72. By followng stochastc gradent methods [6, 18, 10, 41, 33], the learnng rate η may be adusted over tme t Relaton to spke-tmng dependent plastcty 6 In spke-tmng dependent plastcty STDP, the amount of the change n the weght between two neurons that fred together depends on the precse tmngs when the two neurons fred. STDP supplements the Hebban rule [11] and has been expermentally confrmed n bologcal neural networks [5]. 6 Ths secton closely follows [27]. 17

18 In 78, u, s ncreased LTP gets stronger when x [t] = 1 s gven to. Then becomes more lkely to fre when spkes from have recently and frequently arrved at.e., α [ ] s large. Ths, exhbtng a key property of STDP. In partcular, u, s ncreased by a large amount f spkes from have recently and frequently arrved at. Accordng to the second term on the rght-hand sde of 79, v, s ncreased LTD gets stronger amount of the change n u, depends on α [t 1] when x [t] = 0 s gven to a post-synaptc neuron. Then becomes less lkely to fre when spkes from are expected to reach soon.e., β [ ] s large. Ths amount of the change n v, s large f there are spkes n the FIFO queue from to and they are close to. Accordng to the last term of 79, v, s ncreased when x [t] = 0 s gven to the pre-synaptc, and ths amount of the change n v, s proportonal to γ.e., how frequently and recently the post-synaptc has fred. Ths learnng rule of 79 thus exhbts some of the key propertes of LTD wth STDP. In 77, b s ncreased when x [t] = 1 s gven to, so that becomes more lkely to fre n accordance wth the tranng data, but the amount of the change n b s small f s already lkely to fre E θ, [X [t] x [<t] ] 1. Ths dependency on E θ, [X [t] x [<t] ] can be consdered as a form of homeostatc plastcty [42, 20]. Related work There has been a sgnfcant amount of the pror work towards understandng STDP from the perspectves of machne learnng [26, 4, 35]. For example, Nessler et al. show that STDP can be understood as approxmatng the expectaton maxmzaton EM algorthm [26]. Nessler et al. study a partcularly structured wnner-take-all network and ts learnng rule for maxmzng the log lkelhood of gven statc patterns. On the other hand, the DyBM does not assume partcular structures n the network, and the learnng rule havng the propertes of STDP apples for any synapse n the network. Also, the learnng rule of the DyBM maxmzes the log lkelhood of gven tme-seres, and ts learnng rule does not nvolve approxmatons beyond what s assumed n stochastc gradent methods [6]. 5.2 Gvng flexblty to the DyBM 7 It has been shown n [32] that the DyBM n Secton 5.1 has the capablty of assocatve memory and anomaly detecton for sequental patterns, but the applcatons of the DyBM have been lmted to smple tasks wth relatvely low dmensonal tme-seres. In [27], we relax some of the constrants of ths DyBM n a way that t gves more flexblty that s useful for learnng and nference. Specfcally, observe that the frst term on the rght-hand sde of 75 can be rewrtten wth the defnton of β [t 1] n 76 as follows: N =1 v, β [t 1] x [t] = where we let v [δ], v, µ δ. Here, v [δ], = N t 1 =1 s=t d+1 N d 1 v [δ], x[t δ] =1 v, µ s t x [s] x [t] 80 x [t], 81 represents how unlkely fres at tme t f fred at tme t δ. The parametrc form of v [δ], v, µ δ assumes that ths LTD weght decays geometrcally as the nterval, δ, between the two spkes ncreases. In the followng, we relax ths constrant on v [δ], for δ = 1,, d 1 and assumes that these LTD weghts can take ndependent values. Then the energy of the DyBM wth N neurons can be represented 7 Ths secton closely follows [27]. 18

19 convenently wth matrx and vector operatons: E θ x [t] x [<t] N =1 E θ, x [t] x[<t] 82 d 1 b x [t] α [t 1] λ U x [t] + x [t δ] V [δ] x [t] + x [t] V γ µ [t 1], 83 where b b =1,...,N s a vector, U u,, {1,...,N} 2 s a matrx, and other boldface letters are defned analogously a vector s lowercase and a matrx s uppercase. For elgblty traces α [t 1] λ and γ µ [t 1], we append the subscrpt to explctly represent the dependency on the decay rate λ and µ. The functonal form of the energy completely determnes the dynamcs of a DyBM, and relaxng ts constrants allows the DyBM to represent a wder class of dynamcal systems. Notce that the last term of 83 can be dvded nto two terms: x [t] V γ [t 1] µ = γ [t 1] µ V x [t] 84 d 1 = α [t 1] µ V x [t] + x [t δ] ˆV [δ] x [t], 85 where α [t 1] µ s the same as the vector of synaptc elgblty traces but wth the decay rate µ, and ˆV [δ] µ δ V. Comparng 85 and 83, we fnd that, wthout loss of generalty, the energy of the DyBM can be represented wth the followng form: E θ x [t] x [<t] = b + d 1 x [t δ] W [δ] + L l=1 α [t 1] λ l U [l] x [t], 86 where we defne W [δ] = V [δ] ˆV [δ]. The energy n 86 reduces to the orgnal energy n 73 when W [δ] = µ δ V µ δ V, U [1] = U, U [2] = µ d V, λ 1 = λ, λ 2 = µ, and L = 2. Wth L > 2, one can also ncorporate multple synaptc or neural elgblty traces wth varyng decay rates n [32]. Equvalently, we can represent the energy usng neural elgblty traces, γ µl, nstead of synaptc elgblty traces, α λl, as follows: E θ x [t] x [<t] = b + d 1 x [t δ] W [δ] Learnng rule n vector-matrx notatons L l=1 γ [t 1] The learnng rule correspondng to the representaton wth 86 s as follows: µ l V l x [t]. 87 b b + η x [t] E θ [X [t] x [<t] ] 88 W [δ] W [δ] + η x [t δ] x [t] E θ [X [t] x [<t] ] 89 U [l] U [l] + η α [t 1] λ l x [t] E θ [X [t] x [<t] ] 90 for each δ and each l, where E θ [X [t] x [<t] ] s the condtonal expectaton wth respect to P θ x [t] x [<t] = exp E θx [t] x [<t]. 91 exp E θ x [t] x [<t] x [t] 19

20 Specfcally, E θ [X [t] x [<t] ] = expm[t] 1 + expm [t] 92 wth d 1 m [t] b + x [t δ] W [δ] + L l=1 α [t 1] λ l U [l], 93 where exponentaton and dvson of vectors are elementwse. The form of 92 mples that the DyBM s a knd of a logt model, where the feature vector, x [t d+1],, x [t 1], α [t 1] λ, α µ [t 1], depends on the pror values, x [<t], of the tme-seres. By applyng the learnng rules gven n to gven tme-seres, we can learn the parameters of the DyBM or equvalently the parameters of the logt model.e., b, W [δ] for δ = 1,, d 1, and U [l] for l = 1,, L n Dynamc Boltzmann machnes for real-valued tme-seres Gaussan dynamc Boltzmann machnes 8 In ths secton, we show how a DyBM can deal wth real-valued tme-seres n the form of a Gaussan DyBM [27, 8]. A Gaussan DyBM assumes that x [t] follows a Gaussan dstrbuton for each : p θ x[t] x[<t] = 1 2 π σ 2 [t] x exp m [t] 2 σ 2 2, 94 where m [t] s gven by 93, and σ 2 s a varance parameter. Ths Gaussan dstrbuton s n contrast to the Bernoull dstrbuton of the DyBM gven by 72. The condtonal energy of the Gaussan DyBM can be represented as follows: E θ x [t] x [<t] = = N =1 N =1 x [t] m [t] 2 2 σ 2 x [t] b 2 2 σ 2 d 1 N N =1 =1 x [t δ] w [δ], x[t] L N N l=1 =1 =1 95 α [t 1],λ l u [l], x[t] + C, 96 where C s the term that does not depend on x [t]. Because C s canceled out between the numerator and the denomnator n 9, we omt t from the condtonal energy. By lettng W σ [δ] be the matrx whose, element s w [δ], /σ2 and U[l] σ be the matrx whose, element s u [l], /σ2, the condtonal energy of the Gaussan DyBM can be represented as follows: E θ x [t] x [<t] = N =1 x [t] b 2 2 σ 2 x [t δ] W [δ] x [t] d 1 σ L α [t 1] λ l l=1 U [l] σ x [t]. 97 The condtonal energy of the Gaussan DyBM may be compared aganst the energy of the Gaussan Bernoull restrcted Boltzmann machne see [19] or 181 from [28]. 8 Ths secton closely follows [27]. 20

21 We now derve a learnng rule for the Gaussan DyBM n a way that t maxmzes the log-lkelhood of gven tme-seres x: log p θ x [t] x [<t] = t t N =1 log p x [t] x[,t 1], 98 where the summaton over t s over all of the tme steps of x, and the condtonal ndependence between x [t] and x [t] for gven x [<t] s the fundamental property of the DyBM as shown n [32]. The approach of stochastc gradent s to update the parameters of the Gaussan DyBM at each step, t, accordng to the gradent of the condtonal probablty densty of x [t] : log p θ x [t] x [<t] = N =1 1 2 log σ2 + x [t] m [t] 2 σ 2 2, 99 where the equalty follow from 94. From 99 and 93, we can derve the dervatve wth respect to each parameter as follows: log p θ x [t] x [,t 1] = x b log p θ x [t] x [,t 1] = x u, w [δ], [t] [t] log p θ x [t] x [,t 1] = x[t] µ [t] σ 2 µ [t] σ 2 µ [t] σ 2 σ log p θ x [t] x [,t 1] = 1 σ + x [t] where δ {1,, d 1}, l {1,, }, and, {1,, N}. These parameters are thus updated wth learnng rate η as follows: x [t] 100 α [t 1], 101 x [t δ] 102 µ [t] 2 σ 3, 103 b b + η x[t] m [t] σ x [t] m [t] 2 σ 2 σ σ + η σ W [δ] W [δ] + η x [t δ] x [t] m [t] 106 U [l] U [l] + η α [t 1] λ l σ 2 x [t] m [t] 107 where dvson and exponentaton of vectors are elementwse, and m [t] s gven by 93. The maxmum lkelhood estmator of x [t] by the Gaussan DyBM s gven by m [t] n 93. The Gaussan DyBM can thus be understood as a modfcaton to the standard vector autoregressve VAR model. Specfcally, the last term n the rght-hand sde of 93 nvolves elgblty traces, whch can be understood as features of hstorcal values, x,t d], and are added as new varables to the VAR model. Because the value of the elgblty traces can depend on the nfnte past, the Gaussan DyBM can take nto account the hstory beyond the lag d 1. σ 2 21

22 5.3.2 Natural gradents 9 In ths secton, we study a learnng rule based on natural gradent for the Gaussan DyBM. Consder a stochastc model that gves the probablty densty of a pattern x as p θ x. Wth natural gradents [3], the parameters, θ, of the stochastc model are updated as follows: θ t+1 = θ t η G 1 θ t log p θ x 108 at each step t, where η s the learnng rate, and Gθ denotes the Fsher nformaton matrx: Gθ p θ x log p θ x log p θ x dx. 109 Due to the condtonal ndependence n 98, t suffces to derve a natural gradent for each Gaussan unt. Here, we consder the parametrzaton wth mean m and varance v σ 2. The probablty densty functon of a Gaussan dstrbuton s represented wth ths parametrzaton as follows: The log lkelhood of x s then gven by px; m, v = 1 2π v exp x m2 2v. 110 x m2 log px; m, v = 1 2v 2 log v 1 log 2π Hence, the gradent and the nverse Fsher nformaton matrx n 108 are gven as follows: log p θ x = G 1 θ = x m v x m 2 2v 1 2 2v 1 v v 2 1 = The parameters θ t m t, v t are then updated as follows: 112 v 0 0 2v 2, 113 m t+1 = m t + η x m t 114 v t+1 = v t + η x m t 2 v t. 115 In the context of a Gaussan DyBM, the mean s gven by 93, where m [t] s lnear wth respect to b, w,, and u [l],. Also, the varance s gven by σ2. Hence, the natural gradent gves the learnng rules for these parameters as follows: b b + η x [t] m [t] 116 σ 2 σ 2 + η x [t] m [t] 2 σ W [δ] W [δ] + η x [t δ] x [t] m [t] 118 U [l] U [l] + η α [t 1] λ l x [t] m [t], 119 where the exponentaton of a vector s elementwse. We can compare aganst what the standard gradent gves n Ths secton closely follows [27]. 22

23 5.3.3 Usng nonlnear features n Gaussan DyBMs The Gaussan DyBM s a lnear model and has lmted capablty n modelng complex tme-seres. A way to take nto account non-lnear features of tme-seres wth a Gaussan DyBM s to apply nonlnear mappng to nput tme-seres and feed the resultng non-lnear features as addtonal nput to the Gaussan DyBM. An example of such non-lnear mappng s an echo state network ESN [16]. An ESN maps an nput sequence, x, nto ψ recursvely as follows: ψ [t] = 1 ρ ψ [t 1] + ρ tanh W rec ψ [t 1] + W n x [t], 120 where W rec and W n are randomly chosen and fxed parameters 10, and ρ s a leak parameter satsfyng 0 < ρ < 1. In 120, tanh s a hyperbolc tangent functon but may be replaced wth other nonlnear functons such as a sgmod functon. An elgblty trace may be consdered as a lnear counterpart of the nonlnear features created by an ESN. Because these features are generated by mappngs wth fxed parameters and ust gven as nput to a Gaussan DyBM, the learnng rules for the Gaussan DyBM stay unchanged. The nonlnear DyBM n [8] uses an ESN n a slghtly dfferent manner. 5.4 Functonal dynamc Boltzmann machnes We now revew a functonal DyBM, whch models tme-seres of functons patterns over a contnuous space Z [17]. Recall that a Gaussan DyBM defnes the condtonal dstrbuton of the next real-valued vector gven the precedng sequence of real-valued vectors. A functonal DyBM defnes the condtonal dstrbuton of the next functon.e., g [t] gven the precedng sequence of partal observatons of precedng functons. At each tme s, a set of ponts Z [s] z [s] =1,...,Ns s observed, where N s s the number of ponts that are observed at s. The functonal DyBM assumes that the condtonal dstrbuton of g [t] s gven by a Gaussan process, whose mean µ [t] vares over tme dependng on precedng functons as follows: d 1 µ [t] z = bz + Z w [δ] z, z g [t δ] z dz + L l=1 Z u l z, z α [t 1] l z dz 121 for x Z, where b s a functonal bas, w [δ], and u l, are functonal weght for each δ and for each l, and α [t 1] l = t d s= λ t s d l g [s] 122 s a functonal elgblty trace for each l. The covarance k σ 2, of the Gaussan process conssts of two components such that k σ 2z, z = kz, z + σ 2 δz, z, 123 where k, s a arbtrary kernel, δ, s a delta functon, and σ s a hyperparameter. For tractablty, Kano proposes partcular parametrzaton for the functonal bas and functonal weght [17]. Let P = p 1,, p M be a set of arbtrarly selected M ponts n Z 10 The spectral radus of W rec s set smaller than 1. bz = k σ 2z, P b 124 w [δ] z, z = k σ 2z, P W [δ] k σ 2P, z 125 u l z, z = k σ 2z, P U [l] k σ 2P, z

24 for each δ and each l, where k σ 2z, P k σ 2z, p =1,...,M 127 s a row vector, and k σ 2P, z k σ 2p, z =1,...,M 128 s a column vector. Because g [t] s n the reproducng kernel Hlbert space wth kernel k σ 2, substtutng nto 121 gves the followng expresson: θ z = k σ b 2z, P d 1 + W [δ] g [t δ] P + µ [t] L l=1 U [l] α [t 1] l P, 129 where g [t δ] P s a column vector wth -th element beng g [t δ] p, and the elgblty-trace vector P can be recursvely updated as follows: α [t 1] l α [t] l P = λ [t 1] l α l P + g [t d+1] P. 130 Here, we use θ b, W [1],, W [d 1], U [1],, U [L] to collectvely denote the parameters. Whle g [s] p for [1, M] s not observed, Kano uses a maxmum a posteror MAP estmator ĝ [s] p n [17]: ĝ [s] p = µ [s] θ p + kp, Z [t] k σ 2Z [t], Z [t] 1 g [t] Z [t] µ [t] θ Z[t], 131 where k σ 2Z [t], Z [t] s an N s N s matrx wth, -th element beng k σ 2z [t], z [t], and µ[t] θ Z[t] s a column vector defned analogously to g [t] Z [t]. The obectve of learnng a functonal DyBM s to maxmze the log lkelhood of observed values. The condtonal probablty densty of the functonal values of locatons Z [t] at tme t s gven by p θ g [t] Z [t] g [<t] exp 1 g [t] Z [t] µ [t] θ 2 Z[t] kσ 2Z [t], Z [t] 1 g [t] Z [t] µ [t] θ Z[t]. 132 The obectve s thus to maxmze fθ t f t θ, 133 where f t θ log p θ g [t] Z [t] g [<t] 134 = 1 g [t] Z [t] µ [t] θ 2 kσ 2Z [t], Z [t] 1 g [t] Z [t] µ [t] θ Z[t] + C, 135 where C s the term ndependent of θ. The gradent of f t θ s gven by f t θ = µ [t] θ Z[t] k σ 2Z [t], Z [t] 1 g [t] Z [t] µ [t] θ Z[t],

25 Z δ h t δ V δ h t 1 Z 1 V 1 h t x t δ U δ x t 1 U 1 W 1 x t b W δ Fgure 10: A dynamc Boltzmann machne wth hdden unts modfed Fgure 1 from [30]. where 129 gves µ [t] θ b Z[t] = k σ 2p, Z [t] 137 µ [t] θ Z[t] = k σ 2p, Z [t] g [t δ] p 138 w [δ], u [l], µ [t] θ Z[t] = k σ 2p, Z [t] α [t 1] l p 139 for each,, δ, l. The gradent mples the followng learnng rule wth stochastc gradent: b b + η k σ 2P, Z [t] k σ 2Z [t], Z [t] 1 g [t] Z [t] µ [t] θ Z[t] 140 W [δ] W [δ] + η k σ 2P, Z [t] k σ 2Z [t], Z [t] 1 g [t] Z [t] µ [t] θ Z[t] g [t δ] P 141 U [l] U [l] + η k σ 2P, Z [t] k σ 2Z [t], Z [t] 1 g [t] Z [t] µ [t] θ Z[t] α [t 1] l P, 142 where η s a learnng rate, and g [t δ] P s estmated wth the MAP estmator ĝ [t δ] P n Dynamc Boltzmann machnes wth hdden unts 11 In ths secton, we study a DyBM wth hdden unts see Fgure 10. Each layer of ths DyBM corresponds to a tme t δ for 0 δ < and has two parts: vsble and hdden. The vsble part x [t δ] at the δ-th layer represents the values of the tme-seres at tme t δ. The hdden part h [t δ] represents the values of hdden unts at tme t δ. Here, unts wthn each layer do not have connectons to each other. We let x [<t] x [s] s<t and defne h [<t] analogously. The Boltzmann machne n Fgure 10 has bas parameter b and weght parameter U, V, W, Z. Let θ V, W, b be the parameters connected to vsble unts x [t] from the unts n the past, x [s] or h [s] for s < t and φ U, Z. The condtonal energy of ths Boltzmann machne s gven as follows: E θ,φ x [t], h [t] x [<t], h [<t] = E θ x [t] x [<t], h [<t] + E φ h [t] x [<t], h [<t], Ths secton closely follows [30]. 25

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Hidden Markov Models & The Multivariate Gaussian (10/26/04) CS281A/Stat241A: Statstcal Learnng Theory Hdden Markov Models & The Multvarate Gaussan (10/26/04) Lecturer: Mchael I. Jordan Scrbes: Jonathan W. Hu 1 Hdden Markov Models As a bref revew, hdden Markov models

More information

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix Lectures - Week 4 Matrx norms, Condtonng, Vector Spaces, Lnear Independence, Spannng sets and Bass, Null space and Range of a Matrx Matrx Norms Now we turn to assocatng a number to each matrx. We could

More information

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X Statstcs 1: Probablty Theory II 37 3 EPECTATION OF SEVERAL RANDOM VARIABLES As n Probablty Theory I, the nterest n most stuatons les not on the actual dstrbuton of a random vector, but rather on a number

More information

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing CSC321 Tutoral 9: Revew of Boltzmann machnes and smulated annealng (Sldes based on Lecture 16-18 and selected readngs) Yue L Emal: yuel@cs.toronto.edu Wed 11-12 March 19 Fr 10-11 March 21 Outlne Boltzmann

More information

More metrics on cartesian products

More metrics on cartesian products More metrcs on cartesan products If (X, d ) are metrc spaces for 1 n, then n Secton II4 of the lecture notes we defned three metrcs on X whose underlyng topologes are the product topology The purpose of

More information

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction ECONOMICS 5* -- NOTE (Summary) ECON 5* -- NOTE The Multple Classcal Lnear Regresson Model (CLRM): Specfcaton and Assumptons. Introducton CLRM stands for the Classcal Lnear Regresson Model. The CLRM s also

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

Lecture 12: Discrete Laplacian

Lecture 12: Discrete Laplacian Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly

More information

1 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations

1 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations Physcs 171/271 -Davd Klenfeld - Fall 2005 (revsed Wnter 2011) 1 Dervaton of Rate Equatons from Sngle-Cell Conductance (Hodgkn-Huxley-lke) Equatons We consder a network of many neurons, each of whch obeys

More information

Supporting Information

Supporting Information Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to

More information

8 Derivation of Network Rate Equations from Single- Cell Conductance Equations

8 Derivation of Network Rate Equations from Single- Cell Conductance Equations Physcs 178/278 - Davd Klenfeld - Wnter 2015 8 Dervaton of Network Rate Equatons from Sngle- Cell Conductance Equatons We consder a network of many neurons, each of whch obeys a set of conductancebased,

More information

NUMERICAL DIFFERENTIATION

NUMERICAL DIFFERENTIATION NUMERICAL DIFFERENTIATION 1 Introducton Dfferentaton s a method to compute the rate at whch a dependent output y changes wth respect to the change n the ndependent nput x. Ths rate of change s called the

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1 Random varables Measure of central tendences and varablty (means and varances) Jont densty functons and ndependence Measures of assocaton (covarance and correlaton) Interestng result Condtonal dstrbutons

More information

CHAPTER III Neural Networks as Associative Memory

CHAPTER III Neural Networks as Associative Memory CHAPTER III Neural Networs as Assocatve Memory Introducton One of the prmary functons of the bran s assocatve memory. We assocate the faces wth names, letters wth sounds, or we can recognze the people

More information

8 Derivation of Network Rate Equations from Single- Cell Conductance Equations

8 Derivation of Network Rate Equations from Single- Cell Conductance Equations Physcs 178/278 - Davd Klenfeld - Wnter 2019 8 Dervaton of Network Rate Equatons from Sngle- Cell Conductance Equatons Our goal to derve the form of the abstract quanttes n rate equatons, such as synaptc

More information

The Geometry of Logit and Probit

The Geometry of Logit and Probit The Geometry of Logt and Probt Ths short note s meant as a supplement to Chapters and 3 of Spatal Models of Parlamentary Votng and the notaton and reference to fgures n the text below s to those two chapters.

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models ECO 452 -- OE 4: Probt and Logt Models ECO 452 -- OE 4 Maxmum Lkelhood Estmaton of Bnary Dependent Varables Models: Probt and Logt hs note demonstrates how to formulate bnary dependent varables models

More information

BOUNDEDNESS OF THE RIESZ TRANSFORM WITH MATRIX A 2 WEIGHTS

BOUNDEDNESS OF THE RIESZ TRANSFORM WITH MATRIX A 2 WEIGHTS BOUNDEDNESS OF THE IESZ TANSFOM WITH MATIX A WEIGHTS Introducton Let L = L ( n, be the functon space wth norm (ˆ f L = f(x C dx d < For a d d matrx valued functon W : wth W (x postve sem-defnte for all

More information

Conjugacy and the Exponential Family

Conjugacy and the Exponential Family CS281B/Stat241B: Advanced Topcs n Learnng & Decson Makng Conjugacy and the Exponental Famly Lecturer: Mchael I. Jordan Scrbes: Bran Mlch 1 Conjugacy In the prevous lecture, we saw conjugate prors for the

More information

Research Report. Nonlinear Dynamic Boltzmann Machines for Time-series Prediction. Sakyasingha Dasgupta and Takayuki Osogami

Research Report. Nonlinear Dynamic Boltzmann Machines for Time-series Prediction. Sakyasingha Dasgupta and Takayuki Osogami RT0975 Computer Scence; Mathematcs Research Report 11 pages November 15, 2016 Nonlnear Dynamc Boltzmann Machnes for Tme-seres Predcton Sakyasngha Dasgupta and Takayuk Osogam IBM Research - Tokyo IBM Japan,

More information

9 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations

9 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations Physcs 171/271 - Chapter 9R -Davd Klenfeld - Fall 2005 9 Dervaton of Rate Equatons from Sngle-Cell Conductance (Hodgkn-Huxley-lke) Equatons We consder a network of many neurons, each of whch obeys a set

More information

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2) 1/16 MATH 829: Introducton to Data Mnng and Analyss The EM algorthm (part 2) Domnque Gullot Departments of Mathematcal Scences Unversty of Delaware Aprl 20, 2016 Recall 2/16 We are gven ndependent observatons

More information

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U) Econ 413 Exam 13 H ANSWERS Settet er nndelt 9 deloppgaver, A,B,C, som alle anbefales å telle lkt for å gøre det ltt lettere å stå. Svar er gtt . Unfortunately, there s a prntng error n the hnt of

More information

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE Analytcal soluton s usually not possble when exctaton vares arbtrarly wth tme or f the system s nonlnear. Such problems can be solved by numercal tmesteppng

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

Lecture 3: Probability Distributions

Lecture 3: Probability Distributions Lecture 3: Probablty Dstrbutons Random Varables Let us begn by defnng a sample space as a set of outcomes from an experment. We denote ths by S. A random varable s a functon whch maps outcomes nto the

More information

The Order Relation and Trace Inequalities for. Hermitian Operators

The Order Relation and Trace Inequalities for. Hermitian Operators Internatonal Mathematcal Forum, Vol 3, 08, no, 507-57 HIKARI Ltd, wwwm-hkarcom https://doorg/0988/mf088055 The Order Relaton and Trace Inequaltes for Hermtan Operators Y Huang School of Informaton Scence

More information

Limited Dependent Variables

Limited Dependent Variables Lmted Dependent Varables. What f the left-hand sde varable s not a contnuous thng spread from mnus nfnty to plus nfnty? That s, gven a model = f (, β, ε, where a. s bounded below at zero, such as wages

More information

Hidden Markov Models

Hidden Markov Models Hdden Markov Models Namrata Vaswan, Iowa State Unversty Aprl 24, 204 Hdden Markov Model Defntons and Examples Defntons:. A hdden Markov model (HMM) refers to a set of hdden states X 0, X,..., X t,...,

More information

APPENDIX A Some Linear Algebra

APPENDIX A Some Linear Algebra APPENDIX A Some Lnear Algebra The collecton of m, n matrces A.1 Matrces a 1,1,..., a 1,n A = a m,1,..., a m,n wth real elements a,j s denoted by R m,n. If n = 1 then A s called a column vector. Smlarly,

More information

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010 Parametrc fractonal mputaton for mssng data analyss Jae Kwang Km Survey Workng Group Semnar March 29, 2010 1 Outlne Introducton Proposed method Fractonal mputaton Approxmaton Varance estmaton Multple mputaton

More information

Errors for Linear Systems

Errors for Linear Systems Errors for Lnear Systems When we solve a lnear system Ax b we often do not know A and b exactly, but have only approxmatons  and ˆb avalable. Then the best thng we can do s to solve ˆx ˆb exactly whch

More information

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors Stat60: Bayesan Modelng and Inference Lecture Date: February, 00 Reference Prors Lecturer: Mchael I. Jordan Scrbe: Steven Troxler and Wayne Lee In ths lecture, we assume that θ R; n hgher-dmensons, reference

More information

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Structure and Drive Paul A. Jensen Copyright July 20, 2003 Structure and Drve Paul A. Jensen Copyrght July 20, 2003 A system s made up of several operatons wth flow passng between them. The structure of the system descrbes the flow paths from nputs to outputs.

More information

PHYS 705: Classical Mechanics. Canonical Transformation II

PHYS 705: Classical Mechanics. Canonical Transformation II 1 PHYS 705: Classcal Mechancs Canoncal Transformaton II Example: Harmonc Oscllator f ( x) x m 0 x U( x) x mx x LT U m Defne or L p p mx x x m mx x H px L px p m p x m m H p 1 x m p m 1 m H x p m x m m

More information

Homework Assignment 3 Due in class, Thursday October 15

Homework Assignment 3 Due in class, Thursday October 15 Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.

More information

Lecture 4: Universal Hash Functions/Streaming Cont d

Lecture 4: Universal Hash Functions/Streaming Cont d CSE 5: Desgn and Analyss of Algorthms I Sprng 06 Lecture 4: Unversal Hash Functons/Streamng Cont d Lecturer: Shayan Oves Gharan Aprl 6th Scrbe: Jacob Schreber Dsclamer: These notes have not been subjected

More information

Numerical Heat and Mass Transfer

Numerical Heat and Mass Transfer Master degree n Mechancal Engneerng Numercal Heat and Mass Transfer 06-Fnte-Dfference Method (One-dmensonal, steady state heat conducton) Fausto Arpno f.arpno@uncas.t Introducton Why we use models and

More information

Linear Approximation with Regularization and Moving Least Squares

Linear Approximation with Regularization and Moving Least Squares Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

Introduction to Hidden Markov Models

Introduction to Hidden Markov Models Introducton to Hdden Markov Models Alperen Degrmenc Ths document contans dervatons and algorthms for mplementng Hdden Markov Models. The content presented here s a collecton of my notes and personal nsghts

More information

Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia

Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia Usng deep belef network modellng to characterze dfferences n bran morphometry n schzophrena Walter H. L. Pnaya * a ; Ary Gadelha b ; Orla M. Doyle c ; Crstano Noto b ; André Zugman d ; Qurno Cordero b,

More information

EM and Structure Learning

EM and Structure Learning EM and Structure Learnng Le Song Machne Learnng II: Advanced Topcs CSE 8803ML, Sprng 2012 Partally observed graphcal models Mxture Models N(μ 1, Σ 1 ) Z X N N(μ 2, Σ 2 ) 2 Gaussan mxture model Consder

More information

Appendix B: Resampling Algorithms

Appendix B: Resampling Algorithms 407 Appendx B: Resamplng Algorthms A common problem of all partcle flters s the degeneracy of weghts, whch conssts of the unbounded ncrease of the varance of the mportance weghts ω [ ] of the partcles

More information

1 Convex Optimization

1 Convex Optimization Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,

More information

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen Hopfeld networks and Boltzmann machnes Geoffrey Hnton et al. Presented by Tambet Matsen 18.11.2014 Hopfeld network Bnary unts Symmetrcal connectons http://www.nnwj.de/hopfeld-net.html Energy functon The

More information

Effects of Ignoring Correlations When Computing Sample Chi-Square. John W. Fowler February 26, 2012

Effects of Ignoring Correlations When Computing Sample Chi-Square. John W. Fowler February 26, 2012 Effects of Ignorng Correlatons When Computng Sample Ch-Square John W. Fowler February 6, 0 It can happen that ch-square must be computed for a sample whose elements are correlated to an unknown extent.

More information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4) I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes

More information

ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM

ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM An elastc wave s a deformaton of the body that travels throughout the body n all drectons. We can examne the deformaton over a perod of tme by fxng our look

More information

Composite Hypotheses testing

Composite Hypotheses testing Composte ypotheses testng In many hypothess testng problems there are many possble dstrbutons that can occur under each of the hypotheses. The output of the source s a set of parameters (ponts n a parameter

More information

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.65/15.070J Fall 013 Lecture 1 10/1/013 Martngale Concentraton Inequaltes and Applcatons Content. 1. Exponental concentraton for martngales wth bounded ncrements.

More information

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons

More information

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin Proceedngs of the 007 Wnter Smulaton Conference S G Henderson, B Bller, M-H Hseh, J Shortle, J D Tew, and R R Barton, eds LOW BIAS INTEGRATED PATH ESTIMATORS James M Calvn Department of Computer Scence

More information

Applied Stochastic Processes

Applied Stochastic Processes STAT455/855 Fall 23 Appled Stochastc Processes Fnal Exam, Bref Solutons 1. (15 marks) (a) (7 marks) The dstrbuton of Y s gven by ( ) ( ) y 2 1 5 P (Y y) for y 2, 3,... The above follows because each of

More information

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong Moton Percepton Under Uncertanty Hongjng Lu Department of Psychology Unversty of Hong Kong Outlne Uncertanty n moton stmulus Correspondence problem Qualtatve fttng usng deal observer models Based on sgnal

More information

Lecture 21: Numerical methods for pricing American type derivatives

Lecture 21: Numerical methods for pricing American type derivatives Lecture 21: Numercal methods for prcng Amercan type dervatves Xaoguang Wang STAT 598W Aprl 10th, 2014 (STAT 598W) Lecture 21 1 / 26 Outlne 1 Fnte Dfference Method Explct Method Penalty Method (STAT 598W)

More information

The Feynman path integral

The Feynman path integral The Feynman path ntegral Aprl 3, 205 Hesenberg and Schrödnger pctures The Schrödnger wave functon places the tme dependence of a physcal system n the state, ψ, t, where the state s a vector n Hlbert space

More information

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF 10-708: Probablstc Graphcal Models 10-708, Sprng 2014 8 : Learnng n Fully Observed Markov Networks Lecturer: Erc P. Xng Scrbes: Meng Song, L Zhou 1 Why We Need to Learn Undrected Graphcal Models In the

More information

Time-Varying Systems and Computations Lecture 6

Time-Varying Systems and Computations Lecture 6 Tme-Varyng Systems and Computatons Lecture 6 Klaus Depold 14. Januar 2014 The Kalman Flter The Kalman estmaton flter attempts to estmate the actual state of an unknown dscrete dynamcal system, gven nosy

More information

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty Addtonal Codes usng Fnte Dfference Method Benamn Moll 1 HJB Equaton for Consumpton-Savng Problem Wthout Uncertanty Before consderng the case wth stochastc ncome n http://www.prnceton.edu/~moll/ HACTproect/HACT_Numercal_Appendx.pdf,

More information

Solving Nonlinear Differential Equations by a Neural Network Method

Solving Nonlinear Differential Equations by a Neural Network Method Solvng Nonlnear Dfferental Equatons by a Neural Network Method Luce P. Aarts and Peter Van der Veer Delft Unversty of Technology, Faculty of Cvlengneerng and Geoscences, Secton of Cvlengneerng Informatcs,

More information

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30 STATS 306B: Unsupervsed Learnng Sprng 2014 Lecture 10 Aprl 30 Lecturer: Lester Mackey Scrbe: Joey Arthur, Rakesh Achanta 10.1 Factor Analyss 10.1.1 Recap Recall the factor analyss (FA) model for lnear

More information

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012 MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example:

More information

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal Inner Product Defnton 1 () A Eucldean space s a fnte-dmensonal vector space over the reals R, wth an nner product,. Defnton 2 (Inner Product) An nner product, on a real vector space X s a symmetrc, blnear,

More information

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity Week3, Chapter 4 Moton n Two Dmensons Lecture Quz A partcle confned to moton along the x axs moves wth constant acceleraton from x =.0 m to x = 8.0 m durng a 1-s tme nterval. The velocty of the partcle

More information

Foundations of Arithmetic

Foundations of Arithmetic Foundatons of Arthmetc Notaton We shall denote the sum and product of numbers n the usual notaton as a 2 + a 2 + a 3 + + a = a, a 1 a 2 a 3 a = a The notaton a b means a dvdes b,.e. ac = b where c s an

More information

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata Multlayer Perceptrons and Informatcs CG: Lecture 6 Mrella Lapata School of Informatcs Unversty of Ednburgh mlap@nf.ed.ac.uk Readng: Kevn Gurney s Introducton to Neural Networks, Chapters 5 6.5 January,

More information

x = , so that calculated

x = , so that calculated Stat 4, secton Sngle Factor ANOVA notes by Tm Plachowsk n chapter 8 we conducted hypothess tests n whch we compared a sngle sample s mean or proporton to some hypotheszed value Chapter 9 expanded ths to

More information

Open Systems: Chemical Potential and Partial Molar Quantities Chemical Potential

Open Systems: Chemical Potential and Partial Molar Quantities Chemical Potential Open Systems: Chemcal Potental and Partal Molar Quanttes Chemcal Potental For closed systems, we have derved the followng relatonshps: du = TdS pdv dh = TdS + Vdp da = SdT pdv dg = VdP SdT For open systems,

More information

Lecture 17 : Stochastic Processes II

Lecture 17 : Stochastic Processes II : Stochastc Processes II 1 Contnuous-tme stochastc process So far we have studed dscrete-tme stochastc processes. We studed the concept of Makov chans and martngales, tme seres analyss, and regresson analyss

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14 APPROXIMAE PRICES OF BASKE AND ASIAN OPIONS DUPON OLIVIER Prema 14 Contents Introducton 1 1. Framewor 1 1.1. Baset optons 1.. Asan optons. Computng the prce 3. Lower bound 3.1. Closed formula for the prce

More information

A new construction of 3-separable matrices via an improved decoding of Macula s construction

A new construction of 3-separable matrices via an improved decoding of Macula s construction Dscrete Optmzaton 5 008 700 704 Contents lsts avalable at ScenceDrect Dscrete Optmzaton journal homepage: wwwelsevercom/locate/dsopt A new constructon of 3-separable matrces va an mproved decodng of Macula

More information

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur Module Random Processes Lesson 6 Functons of Random Varables After readng ths lesson, ou wll learn about cdf of functon of a random varable. Formula for determnng the pdf of a random varable. Let, X be

More information

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition EG 880/988 - Specal opcs n Computer Engneerng: Pattern Recognton Memoral Unversty of ewfoundland Pattern Recognton Lecture 7 May 3, 006 http://wwwengrmunca/~charlesr Offce Hours: uesdays hursdays 8:30-9:30

More information

Gaussian Mixture Models

Gaussian Mixture Models Lab Gaussan Mxture Models Lab Objectve: Understand the formulaton of Gaussan Mxture Models (GMMs) and how to estmate GMM parameters. You ve already seen GMMs as the observaton dstrbuton n certan contnuous

More information

Notes on Frequency Estimation in Data Streams

Notes on Frequency Estimation in Data Streams Notes on Frequency Estmaton n Data Streams In (one of) the data streamng model(s), the data s a sequence of arrvals a 1, a 2,..., a m of the form a j = (, v) where s the dentty of the tem and belongs to

More information

Expected Value and Variance

Expected Value and Variance MATH 38 Expected Value and Varance Dr. Neal, WKU We now shall dscuss how to fnd the average and standard devaton of a random varable X. Expected Value Defnton. The expected value (or average value, or

More information

FREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced,

FREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced, FREQUENCY DISTRIBUTIONS Page 1 of 6 I. Introducton 1. The dea of a frequency dstrbuton for sets of observatons wll be ntroduced, together wth some of the mechancs for constructng dstrbutons of data. Then

More information

Hidden Markov Models

Hidden Markov Models CM229S: Machne Learnng for Bonformatcs Lecture 12-05/05/2016 Hdden Markov Models Lecturer: Srram Sankararaman Scrbe: Akshay Dattatray Shnde Edted by: TBD 1 Introducton For a drected graph G we can wrte

More information

Relevance Vector Machines Explained

Relevance Vector Machines Explained October 19, 2010 Relevance Vector Machnes Explaned Trstan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introducton Ths document has been wrtten n an attempt to make Tppng s [1] Relevance Vector Machnes

More information

LECTURE 9 CANONICAL CORRELATION ANALYSIS

LECTURE 9 CANONICAL CORRELATION ANALYSIS LECURE 9 CANONICAL CORRELAION ANALYSIS Introducton he concept of canoncal correlaton arses when we want to quantfy the assocatons between two sets of varables. For example, suppose that the frst set of

More information

Classification as a Regression Problem

Classification as a Regression Problem Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class

More information

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018 INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08 Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton

More information

ECE559VV Project Report

ECE559VV Project Report ECE559VV Project Report (Supplementary Notes Loc Xuan Bu I. MAX SUM-RATE SCHEDULING: THE UPLINK CASE We have seen (n the presentaton that, for downlnk (broadcast channels, the strategy maxmzng the sum-rate

More information

3.1 ML and Empirical Distribution

3.1 ML and Empirical Distribution 67577 Intro. to Machne Learnng Fall semester, 2008/9 Lecture 3: Maxmum Lkelhood/ Maxmum Entropy Dualty Lecturer: Amnon Shashua Scrbe: Amnon Shashua 1 In the prevous lecture we defned the prncple of Maxmum

More information

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data Condtonal Random Felds: Probablstc Models for Segmentng and Labelng Sequence Data Paper by John Lafferty, Andrew McCallum, and Fernando Perera ICML 2001 Presentaton by Joe Drsh May 9, 2002 Man Goals Present

More information

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering / Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Lecture 4 Bayes Classfcaton Rule Dept. of Electrcal and Computer Engneerng 0909.40.0 / 0909.504.04 Theory & Applcatons

More information

Winter 2008 CS567 Stochastic Linear/Integer Programming Guest Lecturer: Xu, Huan

Winter 2008 CS567 Stochastic Linear/Integer Programming Guest Lecturer: Xu, Huan Wnter 2008 CS567 Stochastc Lnear/Integer Programmng Guest Lecturer: Xu, Huan Class 2: More Modelng Examples 1 Capacty Expanson Capacty expanson models optmal choces of the tmng and levels of nvestments

More information

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin Fnte Mxture Models and Expectaton Maxmzaton Most sldes are from: Dr. Maro Fgueredo, Dr. Anl Jan and Dr. Rong Jn Recall: The Supervsed Learnng Problem Gven a set of n samples X {(x, y )},,,n Chapter 3 of

More information

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models ECO 452 -- OE 4: Probt and Logt Models ECO 452 -- OE 4 Mamum Lkelhood Estmaton of Bnary Dependent Varables Models: Probt and Logt hs note demonstrates how to formulate bnary dependent varables models for

More information

Lecture 14: Bandits with Budget Constraints

Lecture 14: Bandits with Budget Constraints IEOR 8100-001: Learnng and Optmzaton for Sequental Decson Makng 03/07/16 Lecture 14: andts wth udget Constrants Instructor: Shpra Agrawal Scrbed by: Zhpeng Lu 1 Problem defnton In the regular Mult-armed

More information

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009 College of Computer & Informaton Scence Fall 2009 Northeastern Unversty 20 October 2009 CS7880: Algorthmc Power Tools Scrbe: Jan Wen and Laura Poplawsk Lecture Outlne: Prmal-dual schema Network Desgn:

More information