Inference in Multilayer Networks via. Michael Kearns and Lawrence Saul. AT&T Labs Research. Shannon Laboratory. 180 Park Avenue A-235

Size: px

Start display at page:

Download "Inference in Multilayer Networks via. Michael Kearns and Lawrence Saul. AT&T Labs Research. Shannon Laboratory. 180 Park Avenue A-235"

Stuart Nicholson
6 years ago
Views:

1 Inference n Multlayer etworks va arge Devaton Bounds Mchael Kearns and awrence Saul AT&T abs Research Shannon aboratory 180 Park Avenue A-235 Florham Park, J fmkearns,lsaulg@research.att.com Abstract We study probablstc nference n large, layered Bayesan networks represented as drected acyclc graphs. We show that the ntractablty of exact nference n such networks does not preclude ther eectve use. We gve algorthms for approxmate probablstc nference that explot averagng phenomena occurrng at nodes wth large numbers of parents. We show that these algorthms compute rgorous lower and upper bounds on margnal probabltes of nterest, prove that these bounds become exact n the lmt of large networks, and provde rates of convergence. 1 Introducton The promse of neural computaton les n explotng the nformaton processng abltes of smple computng elements organzed nto large networks. Arguably one of the most mportant types of nformaton processng s the capacty for probablstc reasonng. The propertes of undrected probablstc models represented as symmetrc networks have been studed extensvely usng methods from statstcal mechancs (Hertz et al, 1991). Detaled analyses of these models are possble by explotng averagng phenomena that occur n the thermodynamc lmt of large networks. In ths paper, we analyze the lmt of large, multlayer networks for probablstc models represented as drected acyclc graphs. These models are known as Bayesan networks (Pearl, 1988; eal, 1992), and they have derent probablstc semantcs than symmetrc neural networks (such as Hopeld models or Boltzmann machnes). We show that the ntractablty of exact nference n multlayer Bayesan networks

2 does not preclude ther eectve use. Our work bulds on earler studes of varatonal methods (Jordan et al, 1997). We gve algorthms for approxmate probablstc nference that explot averagng phenomena occurrng at nodes wth 1 parents. We show that these algorthms compute rgorous lower and upper bounds on margnal probabltes of nterest, prove that these bounds become exact n the lmt!1, and provde rates of convergence. 2 Dentons and Prelmnares ABayesan network s a drected graphcal probablstc model, n whch the nodes represent random varables, and the lnks represent causal dependences. The ont dstrbuton of ths model s obtaned by composng the local condtonal probablty dstrbutons (or tables), Pr[chldparents], speced at each node n the network. For networks of bnary random varables, so-called transfer functons provde a convenent way to parameterze condtonal probablty tables (CPTs). A transfer functon s a mappng f :[,1; 1]! [0; 1] that s everywhere derentable and satses f 0 (x) 0 for all x (thus, f s nondecreasng). If f 0 (x) for all x, wesay that f has slope. Common examples of transfer functons of bounded R slope nclude the sgmod f(x) =1=(1 + e,x x ), the cumulatve gaussan f(x) = p,1 dt e,t2 =, and the nosy-or f(x) = 1, e,x. Because the value of a transfer functon f s bounded between 0 and 1, t can be nterpreted as the condtonal probablty that a bnary random varable takes on a partcular value. One use of transfer functons s to endow multlayer networks of soft-thresholdng computng elements wth probablstc semantcs. Ths motvates the followng denton: Denton 1 For a transfer functon f, a layered probablstc f-network has: odes representng bnary varables fxì g, ` =1;:::; and =1;:::;. Thus, s the number of layers, and each layer contans nodes. For every par of nodes X`,1 `,1 from X`,1 to Xì. and Xì n adacent layers, a real-valued weght For every node X 1 n the rst layer, a bas p. We wll sometmes refer to nodes n layer 1 as nputs, and to nodes n layer as outputs. A layered probablstc f-network denes a ont probablty dstrbuton over all of the varables fxì g as follows: each nput node X 1 s ndependently set to 1 wth probablty p, and to 0 wth probablty 1, p. Inductvely, gven bnary values X`,1 = x`,1 2f0; P 1g for all of the nodes n layer `, 1, the node Xì s set to 1 wth probablty f( x`,1 ). `,1 Among other uses, multlayer networks of ths form have been studed as herarchcal generatve models of sensory data (Hnton et al, 1995). In such applcatons, the fundamental computatonal problem (known as nference) s that of estmatng the margnal probablty of evdence at some number of output nodes, say the rst K. (The computaton of condtonal probabltes, such as dagnostc queres, can be reduced to margnals va Bayes rule.) More precsely, one wshes to estmate Pr[X1 = x 1 ;:::;XK = x K] (where x 2f0; 1g), a quantty whose exact computaton nvolves an exponental sum over all the possble settngs of the unnstantated nodes n layers 1 through, 1, and s known to be computatonally ntractable (Cooper, 1990).

3 3 arge Devaton and Unon Bounds One of our man weapons wll be the theory of large devatons. As a rst llustraton of ths theory, consder the nput nodes fx 1 g (whch P are ndependently set to 0 or 1 accordng to ther bases p ) and the weghted sum 1 X1 that feeds nto the th node X 2 n the second layer. A P typcal large devaton bound (Kearns & Saul, 1997) states that for all >0, Pr[ 1 (X 1, p ) >] 2e,22 =( 2 ) where s the largest weght n the network. If we make the scalng assumpton that each weght 1 s bounded by = for some constant (thus, =), then we see that the probablty of large (order 1) devatons of ths weghted sum from ts mean decays exponentally wth. (Our methods can also provde results under the weaker assumpton that all weghts are bounded by O(,a ) for a>1=2.) How can we apply ths observaton to the problem of nference? Suppose we are nterested n the margnal probablty Pr[X 2 = 1]. Then the large devaton bound tells us that wth probablty at least 1, (where we dene =2e P,22 = 2 ), the weghted sum at node X 2 wll be wthn of ts mean value = 1 p.thus, wth probablty at least 1,, we are assured that Pr[X 2 = 1] s at least f(, ) and at most f( + ). Of course, the p sde of the large devaton bound s that wth probablty at most, the weghted sum may fall more than away from. In ths case we can make no guarantees on Pr[X 2 = 1] asde from the trval lower and upper bounds of 0 and 1. Combnng both eventualtes, however, we obtan the overall bounds: (1, )f(, ) Pr[X 2 =1] (1, )f( + )+: (1) Equaton (1) s based on a smple P two-pont approxmaton to the dstrbuton over the weghted sum of nputs, 1 X1. Ths approxmaton places one pont, wth weght 1,, at ether above orbelow the mean (dependng on whether we are dervng the upper or lower bound); and the other pont, wth weght, at ether,1 or +1. The value of depends on the choce of : n partcular, as becomes smaller, we gve more weght to the 1 pont, wth the trade-o governed by the large devaton bound. We regard the weght gven to the 1 pont asa throw-away probablty, snce wth ths weght we resort to the trval bounds of 0 or 1 on the margnal probablty Pr[X 2 = 1]. ote that the very smple bounds n Equaton (1) already exhbt an nterestng trade-o, governed by the choce of the parameter namely, as becomes smaller, the throw-away probablty becomes larger, whle the terms f( ) converge to the same value. Snce the overall bounds nvolve products of f( ) and 1,, the optmal value of s the one that balances ths competton between probable explanatons of the evdence and mprobable devatons from the mean. Ths tradeo s remnscent of that encountered between energy and entropy n mean-eld approxmatons for symmetrc networks (Hertz et al, 1991). So far we have consdered the margnal probablty nvolvng a sngle node n the second layer. We can also compute bounds on the margnal probabltes nvolvng K>1nodes n ths layer (whch wthout loss of generalty we take to be the nodes X1 2 through XK 2 ). Ths s done by consderng the probablty that one or more of the weghted sums enterng these K nodes n the second layer devate by more than from ther means. We can upper bound ths probablty by K by appealng to the so-called unon bound, whch smply states that the probablty of a unon of events s bounded by the sum of ther ndvdual probabltes. The unon bound allows us to bound margnal probabltes nvolvng multple varables. For example,

4 consder the margnal probablty Pr[X1 2 =1;:::;XK 2 = 1]. Combnng the large devaton and unon bounds, we nd: (1,K) KY =1 f(,) Pr[X 2 1 =1;:::;X 2 K =1] (1,K) KY =1 f( +)+K: (2) Anumber of observatons are n order here. Frst, Equaton (2) drectly leads to ecent algorthms for computng the upper and lower bounds. Second, although for smplcty wehave consdered {devatons of the same sze at each node n the second layer, the same methods apply to derent choces of (and therefore ) at each node. Indeed, varatons n can lead to sgncantly tghter bounds, and thus we explot the freedom to choose derent n the rest of the paper. Ths results, for example, n bounds of the form: 1,! KX KY f(, ) Pr[X 2 1 =1;:::;XK 2 =1]; where =2e,22 = 2 : =1 =1 (3) The reader s nvted to study the small but mportant derences between ths lower bound and the one n Equaton (2). Thrd, the arguments leadng to bounds on the margnal probablty Pr[X1 2 =1;:::;XK 2 = 1] generalze n a straghtforward manner to other patterns of evdence besdes all 1's. For nstance, agan ust consderng the lower bound, we have:! KX Y Y 1, [1,f( + )] f(, ) Pr[X 2 1 = x 1;:::;XK 2 = x K] (4) =1 x =0 x =1 where x 2f0; 1g are arbtrary bnary values. Thus together the large devaton and unon bounds provde the means to compute upper and lower bounds on the margnal probabltes over nodes n the second layer. Further detals and consequences of these bounds for the specal case of two-layer networks are gven n a companon paper (Kearns & Saul, 1997); our nterest here, however, s n the more challengng generalzaton to multlayer networks. 4 Multlayer etworks: Inference va Inducton In extendng the deas of the prevous secton to multlayer networks, we face the problem that the nodes n the second layer, unlke those n the rst, are not ndependent. But we can stll adopt an nductve strategy to derve bounds on margnal probabltes. The crucal observaton s that condtoned on the values of the ncomng weghted sums at the nodes n the second layer, the varables fx 2 g do become ndependent. More generally, condtoned on these weghted sums all fallng \near" ther means an event whose probablty we quanted n the last secton the nodes fx 2 g become \almost" ndependent. It s exactly ths near-ndependence that we now formalze and explot nductvely to compute bounds for multlayer networks. The rst tool we requre s an approprate generalzaton of the large devaton bound, whch does not rely on precse knowledge of the means of the random varables beng summed. Theorem 1 For all 1, let X 2f0; 1g denote ndependent bnary random varables, and let. Suppose that the means are bounded bye[x ],p, P where 0 < p 1,. Then for all > 1 : Pr 2 4 X 1 (X, p ) > 5, 2 2e 3 2,, 1 P 2 : (5)

5 The proof of ths result s omtted due to space consderatons. ow for nducton, consder the nodes n the `th P layer of the network. Suppose we are told that for every, the weghted sum `,1 X`,1 enterng nto the node Xì les n the nterval [ì, ì;ì + ì], for some choce of the ì and the ì. Then the mean of node Xì s constraned to le n the nterval [pì, ì;pì +ì], where pì = 1 2 f(ì, ì)+f(ì + ì) (6) f(ì + ì), f(ì, ì) : (7) ì = 1 2 Here we have smply run the leftmost and rghtmost allowed values for the ncomng weghted sums through the transfer functon, and dened the nterval around the mean of unt Xì to be centered around pì. Thus we have translated uncertantes on the ncomng weghted sums to layer ` nto condtonal uncertantes on the means of the nodes Xì n layer `. To complete the cycle, we now translate these nto condtonal uncertantes on the ncomng weghted sums to layer ` + 1. In partcular, condtoned P on the orgnal ntervals [ì, ì;ì + ì], what s probablty that for each, ìx` les nsde some new nterval [, ; + ]? In order to make some guarantee on ths probablty, we set P = ìp` ì`. These condtons suce to ensure that and assume that the new ntervals contan the (condtonal) expected values of the weghted sums P > P ìx`, and that the new ntervals are large enough to encompass the ncomng uncertantes. Because these condtons are a mnmal requrement for establshng any probablstc guarantees, we shall say that the [ì, ì;ì + ì] dene a vald set of -ntervals f they meet these condtons for all 1. Gven a vald set of -ntervals at the (` + 1)th layer, t follows from Theorem 1 and the unon bound that the weghted sums enterng nodes n layer ` + 1 obey where Pr 2 4 X ` X`, > =2e, 2 2, for some 1, P 3 X 5 =1 (8) ì ` 2 : (9) In what follows, we shall frequently make use of the fact that the weghted sums P ìxì are bounded by ntervals [, ; + ]. Ths motvates the followng dentons. Denton 2 Gven a vald set of -ntervals and bnary values fxì = xìg for the nodes n the `th layer, we say that the (` +1)st layer of the network satses ts -ntervals f P ìx`, < for all 1. Otherwse, we say that the (` +1)st layer volates ts -ntervals. Suppose that we are gven a vald set of -ntervals and that we sample from the ont dstrbuton dened by the probablstc f-network. The rght hand sde of Equaton (8) provdes an upper bound on the condtonal probablty that the (` + 1)st layer volates ts -ntervals, gven that the `th layer dd not. Ths upper bound may be vacuous (that s, larger than 1), so let us denote by whchever s smaller the rght hand sde of Equaton (8), or 1; n other words, = mnnp o =1 ; 1. Snce at the `th layer, the probablty of volatng the -ntervals s at most ` we

6 are guaranteed that wth probablty at least Q`>1 [1, `], all the layers satsfy ther -ntervals. Conversely, we are guaranteed that the probablty that any layer volates ts -ntervals s at most 1, Q`>1 [1, `]. Treatng ths as a throw-away probablty,we can now compute upper and lower bounds on margnal probabltes nvolvng nodes at the th layer exactly as n the case of nodes at the second layer. Ths yelds the followng theorem. Theorem 2 For any subset fx1 ;:::;X Kg of the outputs of a probablstc f- network, for any settng x 1 ;:::;x K, and for any vald set of -ntervals, the margnal probablty of partal evdence n the output layer obeys: Y Y 1, ` Y f(, ) 1, f( + ) (10) `>1 x =1 x =0 Pr[X = 1 x 1;:::;XK = x K] Ỳ Y 1, ` Y f( + ) >1 x =1 x =0 1, f(, ) + 1, Ỳ >1 1, `! (11) Theorem 2 generalzes our earler results for margnal probabltes over nodes n the second layer; for example, compare Equaton (10) to Equaton (4). Agan, the upper and lower bounds can be ecently computed for all common transfer functons. 5 Rates of Convergence To demonstrate the power of Theorem 2, we consder how the gap (or addtve derence) between these upper and lower bounds on Pr[X1 = x 1 ;:::;XK = x K] behaves for some crude (but nformed) choces of the fìg. Our goal s to derve the rate at whch these upper and lower bounds converge to the same value as we examne larger and larger networks. Suppose we choose the -ntervals nductvely by denng 1 = 0 and settng = X ì ` + r 2 ln for some >1. From Equatons (8) and (9), ths choce gves 2 1,2 as an upper bound on the probablty that the (` + 1)th layer volates ts -ntervals. Moreover, denotng the gap between the upper and lower bounds n Theorem 2 by G, t can be shown that: G 2 r 2 ln 1, () 1, KX =1 Y v =1 6= f( + ) Y v =0 6= (12) 1, f(, ) + 2 2,1 : et us brey recall the dentons of the parameters on the rght hand sde of ths equaton: s the maxmal slope of the transfer functon f, s the number of nodes n each layer, K s the number of nodes wth evdence, = s tmes the largest weght n the network, s the number of layers, and >1 s a parameter at our dsposal. The rst term of ths bound essentally has a 1=p dependence on, but s multpled by a dampng factor that we mght typcally expect to decay exponentally wth the number K of outputs examned. To see ths, smply notce that each of the factors f( + ) and [1,f(, )] s bounded by 1; furthermore, (13)

7 snce all the means are bounded, f s large compared to then the are small, and each of these factors s n fact bounded by some value <1. Thus the rst term n Equaton (13) s bounded by a constant tmes K,1 K p ln()=. Snce t s natural to expect the margnal probablty of nterest tself to decrease exponentally wth K, ths s desrable and natural behavor. Of course, n the case of large K, the behavor of the resultng overall bound can be domnated by the second term 2= 2,1 of Equaton (13). In such stuatons, however, we can consder larger values of, possbly even of order K; ndeed, for sucently large, the rst term (whch scales lke p )must necessarly overtake the second one. Thus there s a clear trade-o between the two terms, as well as optmal value of that sets them to be (roughly) the same magntude. Generally speakng, for xed K and large, we observe that the derence between our upper p and lower bounds on Pr[X1 = x 1 ;:::;XK = x K]vanshes as O ln()=. 6 An Algorthm for Fxed Multlayer etworks We conclude by notng that the specc choces made for the parameters n Secton 5 to derve rates of convergence may be far from the optmal choces for a xed network of nterest. However, Theorem 2 drectly suggests a natural algorthm for approxmate probablstc nference. In partcular, regardng the upper and lower bounds on Pr[X 1 = x 1;:::;XK = x K] as functons of fìg, we can optmze these bounds by standard numercal methods. For the upper bound, we may perform gradent descent n the fìg to nd a local mnmum, whle for the lower bound, we may perform gradent ascent to nd a local maxmum. The components of these gradents n both cases are easly computable for all the commonly studed transfer functons. Moreover, the constrant of mantanng vald -ntervals can be enforced by mantanng a oor on the -ntervals n one layer n terms of those at the prevous one. The practcal applcaton of ths algorthm to nterestng Bayesan networks wll be studed n future work. References Cooper, G. (1990). Computatonal complexty of probablstc nference usng Bayesan belef networks. Artcal Intellgence 42: Hertz, J,. Krogh, A., & Palmer, R. (1991). Introducton to the theory of neural computaton. Addson-Wesley, Redwood Cty, CA. Hnton, G., Dayan, P., Frey, B., and eal, R. (1995). The wake-sleep algorthm for unsupervsed neural networks. Scence 268:1158{1161. Jordan, M., Ghahraman, Z., Jaakkola, T., & Saul,. (1997). An ntroducton to varatonal methods for graphcal models. In M. Jordan, ed. earnng n Graphcal Models. Kluwer Academc. Kearns, M., & Saul,. (1998). arge devaton methods for approxmate probablstc nference. In Proceedngs of the 14th Annual Conference on Uncertanty n Artcal Intellgence. eal, R. (1992). Connectonst learnng of belef networks. Artcal Intellgence 56:71{113. Pearl, J. (1988). Probablstc Reasonng n Intellgent Systems: etworks of Plausble Inference. Morgan Kaufmann, San Mateo, CA.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson