arxiv: v2 [cs.lg] 16 Sep 2009

Size: px

Start display at page:

Download "arxiv: v2 [cs.lg] 16 Sep 2009"

Shawn Russell
5 years ago
Views:

1 Mnmum Probablty Flow Learnng arxv: v2 [cs.lg] 6 Sep 29 Jascha Sohl-Dcksten ad, Peter Battaglno bd2 and Mchael R. DeWeese bcd3 a Bophyscs Graduate Group, b Department of Physcs, c Helen Wlls Neuroscence Insttute d Redwood Center for Theoretcal Neuroscence Unversty of Calforna, Berkeley, 9472 ascha@berkeley.edu, 2 pbb@berkeley.edu, 3 deweese@berkeley.edu, These authors contrbuted equally. Abstract Learnng n probablstc models s often hampered by the general ntractablty of the normalzaton factor and ts dervatves. Here we propose a new learnng technque that obvates the need to compute an ntractable normalzaton factor or sample from the equlbrum dstrbuton of the model. Ths s acheved by establshng dynamcs that would transform the observed data dstrbuton nto the model dstrbuton, and then settng as the obectve the mnmzaton of the ntal flow of probablty away from the data dstrbuton. Score matchng, mnmum velocty learnng, and certan forms of contrastve dvergence are shown to be specal cases of ths learnng technque. We demonstrate the applcaton of mnmum probablty flow learnng to parameter estmaton n Isng models, deep belef networks, multvarate Gaussan dstrbutons and a contnuous model wth a hghly general energy functon defned as a power seres. In the Isng model case, mnmum probablty flow learnng outperforms current state of the art technques by approxmately two orders of magntude n learnng tme, wth comparable error n recovered parameters. It s our hope that ths technque wll allevate exstng restrctons on the classes of probablstc models that are practcal for use. Introducton Estmatng parameters for probablstc models s a fundamental problem n many scentfc and engneerng dscplnes. Unfortunately, most probablstc learnng technques requre calculatng the normalzaton factor, or partton functon, of the probablstc model n queston, or at least calculatng ts gradent. For the overwhelmng maorty of models there are no known analytc solutons, confnng us to the hghly restrctve subset of probablstc models that can be analytcally solved, or those that can be made tractable usng known approxmate learnng technques. Thus, development of new technques for parameter estmaton n currently ntractable probablstc models has the potental to be of great beneft, lftng near ubqutous restrctons on how we are able to model the world. Many approaches exst for approxmate learnng, ncludng mean feld theory and ts expansons, varatonal Bayes technques and a plethora of samplng or numercal ntegraton based methods [22,, 9, 5]. Of partcular nterest are contrastve dvergence (CD), developed by Wellng, Hnton and Carrera-Perpñán [23, 4], Hyvärnen s score matchng (SM) [7], and the mnmum velocty learnng framework proposed by Movellan [4, 3, 5]. Contrastve dvergence [23, 4] s a varaton on steepest gradent descent of the maxmum (log) lkelhood (ML) obectve functon. Rather than ntegratng over the full model model dstrbuton, CD approxmates the partton functon term n the gradent by averagng over the dstrbuton real-

2 zed after takng a few Markov chan Monte Carlo (MCMC) steps away from the data dstrbuton. Qualtatvely, one can magne that the data dstrbuton s contrasted aganst a dstrbuton whch has evolved a small dstance towards the model dstrbuton, whereas t would usually be contrasted aganst the true model dstrbuton. Although CD s not guaranteed to converge to the rght answer, or even to a fxed pont, t has proven to be an effectve and fast heurstc for parameter estmaton [, 24]. Score matchng, developed by Aapo Hyvärnen [7], s a method that learns parameters n a probablstc model usng only dervatves of the energy functon evaluated over the data dstrbuton (see Equaton (2)). Ths sdesteps the need to explctly sample or ntegrate over the model dstrbuton. In score matchng one mnmzes the expected square dstance of the score functon wth respect to spatal coordnates gven by the data dstrbuton from the smlar score functon gven by the model dstrbuton. It can be seen as an ntegraton of the contrastve dvergence gradent for nfntesmal Langevn dynamcs [8], as the lmt of approxmatng the model dstrbuton by patchng together cutouts of the model dstrbuton around each data pont [2], and fnally as equvalent to mnmum velocty learnng [4]. Mnmum velocty learnng s an approach recently proposed by Movellan [4] that recasts a number of the deas behnd CD, treatng the mnmzaton of the ntal dynamcs away from the data dstrbuton as the goal tself rather than a surrogate for t. Movellan s proposal s that rather than drectly mnmze the dfference between the data and the model, one ntroduces system dynamcs that have the model as ther equlbrum dstrbuton, and mnmzes the ntal flow of probablty away from the data under those dynamcs. If the model looks exactly lke the data there wll be no flow of probablty, and f model and data are smlar the flow of probablty wll tend to be mnmal. Movellan apples ths ntuton to the specfc case of dstrbutons over contnuous state spaces evolvng va dffuson dynamcs. The velocty n mnmum velocty learnng s the dfference n average drft veloctes between partcles dffusng under the model dstrbuton and partcles dffusng under the data dstrbuton. Here we provde a framework, applcable to any parametrc model, of whch mnmum velocty, certan forms of CD, and SM are all specal cases, and whch s n many stuatons more powerful than any of these algorthms. Ths framework extends the deas behnd mnmum velocty learnng to arbtrary state spaces and a far broader class of dynamcs. We show that learnng under ths framework s effectve and fast n a number of cases: Isng models, deep belef networks (DBN), multdmensonal Gaussan dstrbutons, and a complcated two-dmensonal contnuous dstrbuton. 2 Mnmum probablty flow Our goal s to fnd the parameters that cause a probablstc model to best agree wth a set of (assumed d) observatons of the state of a system. We wll do ths by proposng dynamcs that guarantee the transformaton of the data dstrbuton nto the model dstrbuton, and then mnmzng the magntude of the ntal flow of probablty away from the data dstrbuton. 2. Dstrbutons The data dstrbuton s represented by a vector, wth the probablty of observng the system n a state. The superscrpt () represents tme t = under the system dynamcs, as wll The update rule for gradent descent of the negatve log lkelhood, or maxmum lkelhood obectve functon, s θ h P p() log p ( ) (θ) θ = X E (θ) θ + X E (θ) θ p ( ) (θ), where and p ( ) (θ) represent the data dstrbuton and model dstrbuton, respectvely, E (θ) s the energy functon assocated wth the model dstrbuton and ndexes the states of the system (see Secton 2.). The second term n ths gradent can be extremely dffcult to compute (costng n general an amount of tme exponental n the dmensonalty of the system). Under contrastve dvergence p ( ) (θ) s replaced by samples only a few Monte Carlo steps away from the data. 2

3 data dstrbuton dynamcs model dstrbuton ṗ () = = data p ( ) Γ (θ) ṗ (t) = Γ (θ) (θ) (θ) = e E(θ) Z (θ) ṗ ( ) = Fgure : Dynamcs of mnmum probablty flow learnng. Model dynamcs represented by the probablty flow matrx Γ (mddle) determne how probablty flows from the emprcal hstogram of the sample data ponts (left) to the equlbrum dstrbuton of the model (rght) after a suffcently long tme. In ths example there are only four possble states for the system, whch conssts of a par of bnary varables, and the partcular model parameters favor state whereas the data falls mostly on other states. be descrbed n more detal n Secton 2.2. If the observatons were of a two varable bnary system, then would have four entres representng the probabltes of observng states,, and. Our goal s to fnd the parameters θ that cause a model dstrbuton p ( ) (θ) to best match the data dstrbuton. Wthout loss of generalty, we assume the model dstrbuton to be of the form p ( ) (θ) = exp ( E (θ)), () Z (θ) where E (θ) s referred to as the energy functon, and the normalzng factor Z (θ) s called the partton functon, Z (θ) = exp ( E (θ)) (2) (here we have set the temperature of the system to ). The superscrpt ( ) ndcates that ths s the equlbrum dstrbuton reached after runnng the dynamcs for nfnte tme. 2.2 Dynamcs We wsh to generalze Movellan s dffuson dynamcs to arbtrary state spaces. To accomplsh ths, we observe that dffuson dynamcs are a specal case of dynamcs governed by a master equaton that enforces conservaton of probablty [6]: ṗ (t) = Γ (θ) Γ (θ), (3) where ṗ (t) = p(t) t s the rate of change of probablty of state wth tme. Transton rates Γ (θ) gve the rate at whch probablty wll flow from a state nto a state. The frst term of Equaton (3) represents flow of probablty out of other states nto the state, and the second represents flow out of nto other states. The dependence on θ results from the requrement that the dynamcs we choose cause to flow to the equlbrum dstrbuton p ( ) (θ). For readablty, explct dependence on θ wll be dropped except where specfcally relevant. If we choose the dagonal of Γ to obey Γ = Γ, then we can wrte the dynamcs as ṗ (t) = Γ (4) 3

4 (see Fgure ). The unque soluton for s Detaled Balance = exp (Γt). (5) Γ must be chosen such that the dynamcs n Equaton (4) converge to the model dstrbuton. One way to guarantee ths s by choosng Γ such that t satsfes detaled balance for the model dstrbuton p ( ), and such that there s a path through Γ allowng mxng between any two states. Note that there s no need to restrct the dynamcs defned by Γ to those of any real physcal process, such as dffuson. Detaled balance requres that at equlbrum the probablty flow from state nto state equals the probablty flow from nto, whch can be rewrtten as Γ Γ p ( ) (θ) = Γ p ( ) (θ), (6) = p( ) (θ) Γ p ( ) (θ) = exp [E (θ) E (θ)]. (7) Γ s underconstraned by the above equaton. Motvated by symmetry and aesthetcs, we choose as the form for the (non-zero, non-dagonal) entres n Γ [ ] Γ = exp 2 (E (θ) E (θ)) ( ). (8) The choce Γ = Γ = also satsfes Equaton (6), allowng a sparse populaton of Γ for purposes of computatonal tractablty. Theoretcally, to guarantee convergence to the model dstrbuton, the non-zero elements of Γ must be chosen such that, gven suffcent tme, probablty can flow between any par of states. In practce, we wll only need to consder a small fracton of the non-zero elements n Γ (see Secton 2.5). 2.4 Obectve Functon The goal s to mnmze the ntal flow of probablty away from the data dstrbuton (Fgure 2). Although other obectve functons are possble for a mnmum probablty flow approach, we have found the L norm to be partcularly effectve: ˆθ = arg mn K (θ), (9) θ K (θ) = ṗ () (θ) = ṗ () (θ). () Ths obectve functon s unquely zero when and p ( ) (θ) are exactly equal (although n general the relatonshp of ˆθ to the maxmum lkelhood soluton s less clear). Some algebra gves the learnng gradent wth respect to θ: K θ = 2, Γ (θ) [ E (θ) θ E ] (θ) [ ( sgn θ ṗ () (θ) ) ( sgn ṗ () )] (θ). () Note that Equatons (9) through () do not depend on the partton functon Z (θ) or ts dervatves. Under the constrant that Γ does not allow probablty to flow drectly from one state wth data to another - nearly always satsfed when the number of system states s much larger than the number of states wth data - Equaton (9) s equvalent to mnmzng the ntal rate of growth of the KL dvergence between and, D KL( ) t t= (see Appendx A). Under the same constrant, the mnmum probablty flow obectve functon K (θ) s convex for all models p ( ) (θ) n the exponental famly - that s, models who s energy functon E (θ) s lnear n ther parameters θ [2] (see Appendx B). 2 The form chosen for Γ n Equaton (4), coupled wth the satsfacton of detaled balance and ergodcty ntroduced n secton 2.3, guarantees that there s a unque egenvector p ( ) of Γ wth egenvalue zero, and that all other egenvalues of Γ have negatve real parts. 4

5 a b p ( ) (θ) c p () (θ) d. (θ) States of the system Fgure 2: An llustraton of the mnmum probablty flow obectve functon, whch mnmzes the ntal flow of probablty away from the data. a. Emprcal hstogram of the observed data over all possble states of the system. b. Model dstrbuton that the dynamcs would converge to f allowed to run for a suffcently long tme. The dynamcs and model dstrbuton are both functons of the model parameters (θ). Our goal s to make the model, or equlbrum, dstrbuton as much lke the data dstrbuton as possble. c. The dstrbuton after startng at the data and runnng the dynamcs for a short tme perod. d. The temporal dervatve of the probablty dstrbuton, or probablty flow, at t =. Learnng s acheved by changng the model parameters so as to mnmze the shaded regon of ths graph. 2.5 Tractablty The vector s typcally huge, as s Γ (e.g., 2 N and 2 N 2 N, respectvely, for an N-bt bnary system). Naïvely, ths would seem to prohbt evaluaton and mnmzaton of the obectve functon. Fortunately, all the elements n not correspondng to observatons are zero. Snce our obectve functon s only evaluated at tme t = ths allows us to gnore all those Γ for whch no data pont exsts at state. Addtonally, there s a great deal of flexblty as far as whch elements of Γ can be set to zero. By populatng Γ so as to connect each state to a small fxed number of addtonal states, the cost of the algorthm n both memory and tme s O(M), where M s the number of observed data ponts, and does not depend on the number of system states. 2.6 Contnuous Systems Although we have motvated ths technque usng systems wth a large, but fnte, number of states, t generalzes n a straghtforward manner to contnuous systems. The flow matrx Γ and dstrbuton vectors transton from beng very large to beng nfnte n sze. Γ can stll be chosen to connect each state to a small, fnte, number of addtonal states however, and only outgong probablty flow from states wth data contrbutes to the obectve functon, so the cost of learnng remans largely unchanged. In addton, for a partcular pattern of connectvty n Γ ths obectve functon, lke Movellan s [4], reduces to score matchng [7] (other connectvty patterns reduce to alternate forms). Takng the lmt of connectons between all states wthn a small dstance ɛ of each other, and then Taylor expandng n ɛ, one can show that, up to an overall constant and scalng factor K K SM = {samples} [ ] 2 E(x ) E(x ) 2 E(x ). (2) 5

Mean absolute correlaton error..8.6.4.2 4 unt Isng model 5 5 2 25 Tme (sec) Mean absolute correlaton error.6.5.4.3.2. unt Isng model 5 5 Tme (sec) Fgure 3: A demonstraton of rapd fttng of the Isng model by mnmum probablty flow learnng.

Convergence s reached n about 5 seconds for 2, samples from the 4 unt model (left) and n about mnute for, samples from the unt model (rght). Detals of the unt model can be seen n Fgure 4.

(left) Randomly chosen Gaussan couplng matrx J (top) wth varance.4 and assocated correlaton matrx C (bottom) for a unt, fully-connected Isng model.

(center) The recovered couplng and correlaton matrces after mnmum probablty flow learnng on, samples from the model n the left panels.

3 Expermental Results Matlab code mplementng mnmum probablty flow learnng for each of the followng cases s avalable upon request. A publc toolkt s under constructon.

6 Mean absolute correlaton error unt Isng model Tme (sec) Mean absolute correlaton error unt Isng model 5 5 Tme (sec) Fgure 3: A demonstraton of rapd fttng of the Isng model by mnmum probablty flow learnng. The mean absolute error n the learned model s correlaton matrx s shown as a functons of learnng tme for 4 and unt fully connected Isng models. Convergence s reached n about 5 seconds for 2, samples from the 4 unt model (left) and n about mnute for, samples from the unt model (rght). Detals of the unt model can be seen n Fgure 4. J J new J J new C C new C C new Fgure 4: An example unt Isng model ft usng mnmum probablty flow learnng. (left) Randomly chosen Gaussan couplng matrx J (top) wth varance.4 and assocated correlaton matrx C (bottom) for a unt, fully-connected Isng model. The dagonal has been removed from the correlaton matrx C for ncreased vsblty. (center) The recovered couplng and correlaton matrces after mnmum probablty flow learnng on, samples from the model n the left panels. (rght) The error n recovery of the couplng and correlaton matrces. Ths reproduces the lnk dscovered by Movellan [4] between dffuson dynamcs over contnuous spaces and score matchng. 3 Expermental Results Matlab code mplementng mnmum probablty flow learnng for each of the followng cases s avalable upon request. A publc toolkt s under constructon. All mnmzaton was performed usng Mark Schmdt s remarkably effectve mnfunc [7]. 3. Isng model The Isng model has a long and stored hstory n physcs [3] and machne learnng [] and t has recently been found to be a surprsngly useful model for networks of neurons n the retna [8, 2]. The ablty to ft Isng models to the actvty of large groups of smultaneously recorded neurons s 6

A reasonable probablstc model for handwrtten dgts has been learned. (rght) Confabulatons after tranng va sngle step CD. Note the uneven dstrbuton of dgt occurrences.

7 2 unts 2 unts 2 unts 2 unts 28x28 pxels Fgure 5: A deep belef network traned usng mnmum probablty flow learnng and contrastve dvergence. (left) A four layer deep belef network was traned on the MNIST postal hand wrtten dgts dataset. (center) Confabulatons after tranng va mnmum probablty flow learnng. A reasonable probablstc model for handwrtten dgts has been learned. (rght) Confabulatons after tranng va sngle step CD. Note the uneven dstrbuton of dgt occurrences. of current nterest gven the ncreasng number of these types of data sets from the retna, cortex and other bran structures. We ft an Isng model (fully vsble Boltzmann machne) of the form p ( ) (x; J) = Z(J) exp J x x (3), to a set of N d-element d data samples { x () =...N } generated va Gbbs samplng from an Isng model as descrbed below, where each of the d elements of x s ether or. Because each x {, }, x 2 = x, we can wrte the energy functon as E(x; J) = J x. (4), J x x + The probablty flow matrx Γ has 2 N 2 N elements, but we allow only elements correspondng to transtons nto states a sngle bt-flp away to be non-zero. Fgure 3 shows the average error n predcted correlatons as a functon of learnng tme for 2, samples from a 4 unt, fully connected Isng model. The J used were gracously provded by Broderck and coauthors, and were dentcal to those used for synthetc data generaton n the 28 paper Faster solutons of the nverse parwse Isng problem [2]. Tranng was performed on 2, samples so as to match the number of samples used n secton III.A. of Broderck et al. Note that gven suffcent samples, the mnmum probablty flow algorthm would converge exactly to the rght answer, as learnng n the Isng model s convex (Appendx B), and has ts global mnmum at the true soluton. On an 8 core 2.33 GHz Intel Xeon, the learnng converges n about 5 seconds. Broderck et al. perform a smlar learnng task on a -CPU grd computng cluster, wth a convergence tme of approxmately 2 seconds. Smlar learnng was performed for, samples from a unt, fully connected, Isng model. A couplng matrx was chosen wth elements randomly drawn from a Gaussan wth mean and varance.4. Usng the mnmum probablty flow learnng technque, learnng took approxmately mnute, compared to roughly 2 hours for a unt (nearest neghbor couplng only) model of retnal data [9] (personal communcaton, J. Shlens). Fgure 4 demonstrates the recovery of the couplng and correlaton matrces for our fully connected Isng model, whle Fgure 3 shows the tme course for learnng. 3.2 Deep Belef Network As a demonstraton of learnng on a more complex dscrete valued model, we traned a 4 layer deep belef network (DBN) [6] on MNIST handwrtten dgts. A DBN conssts of stacked restrcted Boltzmann machnes (RBMs), such that the hdden layer of one RBM forms the vsble layer of the 7

Fgure 6: A contnuous state space model ft usng mnmum probablty flow learnng. (left) Randomly chosen couplng matrx Σ and assocated covarance matrx Σ for a dmensonal Gaussan dstrbuton.

8 Fgure 6: A contnuous state space model ft usng mnmum probablty flow learnng. (left) Randomly chosen couplng matrx Σ and assocated covarance matrx Σ for a dmensonal Gaussan dstrbuton. (center) The recovered couplng matrx Σ new and assocated covarance matrx Σ new after mnmum probablty flow learnng on, samples from the model n (left). (rght) The error n recovery of the couplng and covarance matrces. next. Each RBM has the form: p ( ) (x vs, x hd ; W) = p ( ) (x vs ; W) = Z(W) exp W x vs, x hd,, (5), Z(W) exp ( [ log + exp ]) W x vs,. (6) Note that samplng-free applcaton of the mnmum probablty flow algorthm requres analytcally margnalzng over the hdden unts. RBMs were traned n sequence, startng at the bottom layer, on, samples from the MNIST postal hand wrtten dgts data set. As n the Isng case, the probablty flow matrx Γ was populated so as to connect every state to all states whch dffered by only a sngle bt flp. Tranng was performed by both mnmum probablty flow and sngle step CD to allow a smple comparson of the two technques (note that CD turns nto full ML learnng as the number of steps s ncreased, and that the qualty of the CD answer can thus be mproved at the cost of computatonal tme by usng many-step CD). Confabulatons were performed by Gbbs samplng from the top layer RBM, then propagatng each sample back down to the pxel layer by way of the condtonal dstrbuton p ( ) (x vs x hd ; W k ) for each of the ntermedary RBMs, where k ndexes the layer n the stack. As shown n Fgure 5, mnmum probablty flow learned a good model of handwrtten dgts. 3.3 Gaussan As an example of mnmum probablty flow learnng appled to contnuous models, we ft a multvarate Gaussan dstrbuton to synthetc data. The model dstrbuton has the form p ( ) (x; Σ ) = [ Z (Σ ) exp ] 2 xt Σ x, (7) wth vector x and couplng matrx Σ. We ft to, d samples from a -dmensonal Gaussan dstrbuton. The probablty flow matrx Γ was populated so as to connect every state to 2 addtonal states, chosen from a Gaussan dstrbuton wth varance. centered on the state. Results are shown n Fgure 6. 8

.8.6.4.2.2.4.6.8.8.8.6.6.4.4.2.2.2.2.4.4.6.6.8.8.8.6.4.2.2.4.6.8 Fgure 7: A hghly unconstraned, dffcult to normalze, model ft usng mnmum probablty flow learnng.

Note, the mage represents a dstrbuton over (x, y) values, not a sample from a dstrbuton.

9 Fgure 7: A hghly unconstraned, dffcult to normalze, model ft usng mnmum probablty flow learnng. (left) Hstogram of a complcated two-dmensonal contnuous dstrbuton (x, y), (x, y) [, ] 2. The probablty of observng a sample (x, y) s proportonal to the pxel value at locaton (x, y) n the hstogram. Note, the mage represents a dstrbuton over (x, y) values, not a sample from a dstrbuton. (center) Scatter plot of, samples drawn from the dstrbuton n (left) (rght) Hstogram of learned dstrbuton p ( ) (x, y; θ) traned n batch( mode on groups of, samples from the dstrbuton n (left), where p ( ) (x, y; θ) exp 28 ) 28 m= n= θ mnl m (x) L n (y) and L m (x) s the mth Legendre polynomal n x. 3.4 Power Seres Energy Functon To demonstrate mnmum probablty flow s effectveness n an extremely flexble, dffcult to normalze, model, we learned parameters θ for a two-dmensonal contnuous dstrbuton of the form [ ] p ( ) (x, y; θ) = M Z (θ) exp θ mn L m (x)l n (y), (8) m,n= where L m (x) s the mth order Legendre polynomal n x, (x, y) [, ] 2, M s the maxmum polynomal order, and Z (θ) s the normalzaton factor. We ft an M = 28 dstrbuton usng consecutve lne searches on batches of, d samples from the dstrbuton shown on the left of Fgure 7. The probablty flow matrx Γ was populated so as to connect every state wth 2 other states, chosen from a unform dstrbuton n the range [, ] 2. Fgure 7 shows a hstogram of the data dstrbuton (x, y; θ) compared to a hstogram of the learned Legendre functon expanson p ( ) (x, y; θ). 4 Summary We have presented a novel framework for effcent learnng n the context of any parametrc model. Ths method was nspred by the mnmum velocty approach developed by Movellan, and t reduces to that technque as well as to score matchng and some forms of contrastve dvergence under sutable choces for the dynamcs and state space. By decouplng the dynamcs from any specfc physcal process, such as dffuson, and focusng on the ntal flow of probablty from the data to a subset of other states chosen n part for ther utlty and convenence, we have arrved at a framework that s not only more general than prevous approaches, but also potentally much more powerful. We expect that ths framework wll render some prevously ntractble models more amenable to estmaton. Acknowledgments We would lke to thank Javer Movellan for sharng a work n progress; Tamara Broderck, Mroslav Dudík, Gašper Tkačk, Robert E. Schapre and Wllam Balek for use of ther Isng model couplng parameters; Jonathon Shlens for useful dscusson and ground truth for hs Isng model convergence tmes; Bruno Olshausen, Anthony Bell, Chrstopher Hllar, Charles Cadeu, Klan Koepsell and the 9

10 rest of the Redwood Center for many useful dscussons and for comments on earler versons of the manuscrpt; Ashvn Vshwanath for useful dscusson; and the Canadan Insttute for Advanced Research - Neural Computaton and Percepton Program for ther fnancal support (JSD). APPENDICES A Connecton to KL Dvergence We want to measure the rate of growth of the KL dvergence at tme, D KL t t=. The KL dvergence between the data dstrbuton, and the dstrbuton resultng after runnng the dynamcs for a tme t, s D KL = log log. (A-) Note that the terms for whch = wll never contrbute to ths sum. To make ths explct, we rewrte the sum as beng over the set of states whch are non-zero n. Ths set s { } D = :. (A-2) We also note the complement of ths set, Ths makes the KL dvergence D KL = D D C = { } : =. (A-3) log log. (A-4) D The dervatve s D KL t = D = D = D = D D D δ D δ t δ [ Γ [ Γ Γ Γ ] Γ ] Γ + D D () p D C () p D C Γ Γ. (A-5) (A-6) (A-7) (A-8) In the last lne the sum over all has been broken nto a sum over D and ts complement D C. We evaluate the dervatve at t = D KL t= = δ Γ + Γ (A-9) t D D D D C δ Γ Γ. D D D D C We can smplfy ths by notng that the followng terms are : δ Γ δ Γ = (A-) D D D D Γ = (A-) D D C

11 Ths means that the rate of growth of the KL dvergence at the data dstrbuton, t =, s D KL = Γ t=. (A-2) t D D C That s, the rate of growth of the KL dvergence s equal to the rate of probablty flow from states wth data to those wthout. Ths s equvalent to the mnmum probablty flow L obectve functon n the usual case that Γ does not allow probablty to flow drectly from one state wth data to another. B Convexty As observed by Macke and Gerwnn [2], Equaton (A-2) s convex for models n the exponental famly. We wsh to mnmze K has dervatve and Hessan K = D K = θ m θ m D D c = 2 2 K = θ m θ n D C Γ. (B-) ( Γ Γ D D c Γ D D c Γ D D c ) ( E θ m E θ m ( E E θ m θ m ( 2 E θ m θ n (B-2) ), (B-3) ) ( E E θ n θ n 2 E θ m θ n ) (B-4) ). (B-5) The frst term s a weghted sum of outer products, wth non-negatve weghts 4 Γ, and s thus postve semdefnte. The second term s for models n the exponental famly (those wth energy functons lnear n ther parameters). Parameter estmaton for models n the exponental famly s therefore convex usng mnmum probablty flow learnng, n the commonly satsfed lmt that Γ does not drectly connect any two data ponts. References [] D H Ackley, G E Hnton, and T J Senowsk. A learnng algorthm for Boltzmann machnes. Cogntve Scence, 9(2):47 69, 985. [2] T Broderck, M Dudík, G Tkačk, R Schapre, and W Balek. Faster solutons of the nverse parwse Isng problem. E-prnt arxv, Jan 27. [3] S G Brush. Hstory of the Lenz-Isng model. Revews of Modern Physcs, 39(4): , Oct 967. [4] M A Carrera-Perpñán and G E Hnton. On contrastve dvergence (CD) learnng. Techncal report, Dept. of Computere Scence, Unversty of Toronto, 24. [5] S Haykn. Neural networks and learnng machnes; 3rd edton. Prentce Hall, 28. [6] Geoffrey E Hnton, Smon Osndero, and Yee-Whye Teh. A fast learnng algorthm for deep belef nets. Neural Computaton, 8(7): , Jul 26. [7] A Hyvärnen. Estmaton of non-normalzed statstcal models usng score matchng. Journal of Machne Learnng Research, 6:695 79, 25.

12 [8] A Hyvärnen. Connectons between score matchng, contrastve dvergence, and pseudolkelhood for contnuous-valued varables. IEEE Transactons on Neural Networks, Jan 27. [9] T Jaakkola and M Jordan. A varatonal approach to Bayesan logstc regresson models and ther extensons. Proceedngs of the Sxth Internatonal Workshop on Artfcal Intellgence and Statstcs, Jan 997. [] H Kappen and F Rodríguez. Mean feld approach to learnng n Boltzmann machnes. Pattern Recognton Letters, Jan 997. [] D MacKay. Falures of the one-step learnng algorthm. Jan 2. [2] J Macke and S Gerwnn. Personal communcaton. 29. [3] J R Movellan. Contrastve dvergence n Gaussan dffusons. Neural Computaton, 2(9): , 28. [4] J R Movellan. A mnmum velocty approach to learnng. unpublshed draft, Jan 28. [5] J R Movellan and J L McClelland. Learnng contnuous probablty dstrbutons wth symmetrc dffuson networks. Cogntve Scence, 7: , 993. [6] R Pathra. Statstcal Mechancs. Butterworth Henemann, Jan 972. [7] M Schmdt. mnfunc. schmdtm/software/mnfunc.html, 25. [8] E Schnedman, M J Berry 2nd, R Segev, and W Balek. Weak parwse correlatons mply strongly correlated network states n a neural populaton. Nature, 44(787):7 2, 26. [9] J Shlens, G D Feld, J L Gauther, M Greschner, A Sher, A M Ltke, and E J Chchlnsky. The structure of large-scale synchronzed frng n prmate retna. Journal of Neuroscence, 29(5):522 53, Apr 29. [2] J Shlens, G D Feld, J L Gauther, M I Grvch, D Petrusca, A Sher, A M Ltke, and E J Chchlnsky. The structure of mult-neuron frng patterns n prmate retna. J. Neurosc., 26(32): , 26. [2] J Sohl-Dcksten and B Olshausen. A spatal dervaton of score matchng. Redwood Center Techncal Report, 29. [22] T Tanaka. Mean-feld theory of Boltzmann machne learnng. Physcal Revew Letters E, Jan 998. [23] M Wellng and G Hnton. A new learnng algorthm for mean feld Boltzmann machnes. Lecture Notes n Computer Scence, Jan 22. [24] A Yulle. The convergence of contrastve dvergences. Department of Statstcs, UCLA. Department of Statstcs Papers., 25. 2

Minimum Probability Flow Learning

Minimum Probability Flow Learning Mnmum Probablty Flow Learnng Jascha Sohl-Dcksten ab jascha@berkeley.edu Peter Battaglno ac pbb@berkeley.edu Mchael R. DeWeese acd deweese@berkeley.edu a Redwood Center for Theoretcal Neuroscence, b Bophyscs