Second order approximations for probability models

Size: px

Start display at page:

Download "Second order approximations for probability models"

Suzan Ramsey
5 years ago
Views:

1 Second order approxmatons for probablty models lbert Kappen Department of Bophyscs Njmegen Unversty Njmegen, the Netherlands bertmbfysunnl Wm Wegernc Department of Bophyscs Njmegen Unversty Njmegen, the Netherlands wmwmbfysunnl Abstract In ths paper, we derve a second order mean feld theory for drected graphcal probablty models By usng an nformaton theoretc argument t s shown how ths can be done n the absense of a partton functon Ths method s a drect generalsaton of the well-nown TA approxmaton for Boltzmann Machnes In a numercal example, t s shown that the method greatly mproves the frst order mean feld approxmaton For a restrcted class of graphcal models, so-called sngle overlap graphs, the second order method has comparable complexty to the frst order method For sgmod belef networs, the method s shown to be partcularly fast and effectve 1 Introducton Recently, a number of authors have proposed determnstc methods for approxmate nference n large graphcal models The smplest approach gves a lower bound on the probablty of a subset of varables usng enssen s nequalty (Saul et al, 1996) The method nvolves the mnmzaton of the K dvergence between the target probablty dstrbuton and some smple varatonal dstrbuton The method can be appled to a large class of probablty models, such as sgmod belef networs, DAGs and Boltzmann Machnes (BM) For Boltzmann-Gbbs dstrbutons, t s possble to derve the lower bound as the frst term n a Taylor seres expanson of the free energy around a factorzed model The free energy s gven by, where s the normalzaton constant of the Boltzmann-Gbbs dstrbuton: $ "!#! Ths Taylor seres can be contnued and the second order term s nown as the TA correcton (lefa, 1982; Kappen and Rodríguez, 1998) The second order term sgnfcantly mproves the qualty of the approxmaton, but s no longer a bound For probablty dstrbutons that are not Boltzmann-Gbbs dstrbutons, t s not obvous how to obtan the second order approxmaton owever, there s an alternatve way to compute the hgher order correctons, based on an nformaton theoretc argument Recently, ths argument was appled to stochastc neural networs wth asymmetrc connectvty (Kappen and Spanjers, 1999) ere, we apply ths dea to drected graphcal models

2 C C : 2 The method et be an -dmensonal vector, wth tang on dscrete values et # be a drected graphcal model on We wll assume that can be wrtten as a product of potentals n the followng way: (1) ere, # denotes the condtonal probablty table of varable gven the values of ts parents denotes the subset of varables that appear n potental and # otentals can be overlappng, "! $#, and $% We wsh to compute the margnal probablty that & has some specfc value ' n the presence of some evdence We therefore denote )( ' where ( denote the subset of varables that consttute the evdence, and ' denotes the remander of the varables The margnal s gven as ' ( )( (2) Both numerator and denomnator contan sums over hdden states These sums scale exponentally wth the sze of the problem, and therefore the computaton of margnals s ntractable We propose to approxmate ths problem by usng a mean feld approach Consder a factorzed dstrbuton on the hdden varables * : ' 3) We wsh to fnd the factorzed dstrbuton that best approxmates ' ( Consder as a dstance measure +-, It s easy to see that the that mnmzes ' ( +3, 0/ ' ( ' 21 satsfes: () ' (5) We now thn of the manfold of all probablty dstrbutons of the form Eq 1, spanned by the coordnates # 5 76 For each, s a table of numbers, ndexed by Ths manfold contans a submanfold of factorzed probablty dstrbutons n whch the potentals factorze: 98 ;: )< 8 : ;< # # 7=?> *, ' ( reduces to ' When n addton, Assume now that ' ( s somehow close to the factorzed submanfold The dfference ' s then small, and we can expand ths small dfference n terms of changes n the parameters A B6 : ;C /GF I1K ED F M / F D N O F F Q 1R Q hgher order terms (6)

3 F The dfferentals are evaluated n the factorzed dstrbuton The left-hand sze of Eq 6 s zero because of Eq 5 and we solve for ' Ths factorzed dstrbuton gves the desred margnals up to the order of the expanson of It s straghtforward to compute the dervatves: F F F F Q ' ( ( Q ' ( Q ( ' ( Q ' ( ( Q ( (7) We ntroduce the notaton and as the expectaton values wth respect to the factorzed dstrbutons ' ( and (, respectvely We defne We obtan M : To frst order, settng Eq 8 equal to zero we obtan hgher order terms (8) # # ' (9) where we have absorbed all terms ndependent of = nto a constant Thus, we fnd the soluton ' (10) n whch the constants follow from normalsaton The frst order term s equvalent to the standard mean feld equatons, obtaned from ensens nequalty The correcton wth second order terms s obtaned n the same way, agan droppng terms ndependent of = : ' # M : (11) were, agan, the constants follow from normalsaton These equatons, whch form the man result of ths paper, are generalzaton of the mean feld equatons wth TA correctons for drected graphcal models Both left and rght-hand sze of Eqs 10 and 11 depend on the unnown probablty dstrbuton ' and can be solved by fxed pont teraton 3 Complexty and sngle-overlap graphs The complexty of the frst order equatons Eq 10 s exponental n the number of varables n the potentals of : f the maxmal clque sze s, then for each = we need of the order of computatons, where s the number of clques that contan node = The second term scales worse, snce one must compute averages over the unon of two overlappng clques and because of the double sum owever, thngs are not so bad Frst

4 vst to Asa? Smong? vst to Asa? Smong? Tuberculoss? ung cancer? Bronchts? Tuberculoss? ung cancer? Bronchts? Ether tub or lung canc? Ether tub or lung canc? postve X-ray? Dyspnoea? postve X-ray? Dyspnoea? Fgure 1: An example of a sngle-overlap graph eft: The chest clnc model (ASIA)(aurtzen and Spegelhalter, 1988) Rght: nodes wthn one potental a re grouped together, showng that potentals share at most one node of all, notce that the sum over and can be restrcted to overlappng clques (! # ) and that = must be n ether or or both (= > % ) Denote by the number of clques that have at least one varable n common wth clque and denote by Then, the sum over and contans not more than terms M Each term s an average over the unon of two clques, whch can be worse case of sze (when only one varable s shared) owever, snce ( means expectaton wrt condtoned on the varables n ) we can precompute for all pars of overlappng clques, for all states n Therefore, the worse case complexty of the second order term s less than Thus, we see that the second order method has the same exponental complexty as the frst order method, but wth a dfferent polynomal prefactor Therefore, the frst or second order method can be appled to drected graphcal models as long as the number of parents s reasonably small The fact that the second order term has a worse complexty than the frst order term s n contrast to Boltzmann machnes, n whch the TA approxmaton has the same complexty as the standard mean feld approxmaton Ths phenomenon also occurs for a specal class of DAGs, whch we call sngle-overlap graphs These are graphs n whch the potentals share at most one node Fgure 1 shows an example of a sngle-overlap graph For sngle overlap graphs, we can use the frst order result Eq 9 to smplfy the second order correcton The dervaton rather tedous and we just present the result ' M : ;< : ;< (12) whch has a complexty that s of order For probablty dstrbutons wth many small potentals that share nodes wth many other potentals, Eq 12 s more effcent than Eq 11 For nstance, for Boltzmann Machnes and M In ths case, Eq 12 s dentcal to the TA equatons (Thouless et al, 1977) Sgmod belef networs In ths secton, we consder sgmod belef networs as an nterestng class of drected graphcal models The reason s, that one can expand n terms of the couplngs nstead of the potentals whch s more effcent The sgmod belef networ s defned as # * (13)

5 8 ( 6 l l (a) (b) (c) (d) Fgure 2: Interpretaton of dfferent nteracton terms appearng n Eq 16 The open and shaded nodes are hdden and evdence nodes, respectvely (except n (a), where can be any node) Sold arrows ndcate the graphcal structure n the networ Dashed arrows ndcate nteracton terms that appear n Eq 16 where # M, and * s the local feld: * We separate the varables n evdence varables ( and hdden varables ' : ' ( When couplngs from hdden nodes to ether hdden or evdence nodes are zero,, =?> ( ' and > ', the probablty dstrbutons ' ( and )( reduce to ' ( ' )< ' (1) )( ( ( (15) ;< where $8 < depends on the evdence We expand to second order around ths tractable dstrbuton and obtan 6 M ( ( 6 6 < : < < 6 )( ( ( ( 6 < < : < M B 6 < : < wth 6 ' ' and s gven by Eq 15 ( ( (16) The dfferent terms that appear n ths equaton can be easly nterpreted The frst term descrbes the lowest order forward nfluence on node = from ts parents arents can be ether evdence or hdden nodes (fg 2a) The second term s the bas The thrd term descrbes to lowest order the effect of Bayes rule: t affects 6 such that the observed evdence on ts chldren becomes most probable (fg 2b) Note, that ths term s absent when the evdence s explaned by the evdence nodes themselves: ( The fourth and ffth terms are the quadratc contrbutons to the frst and thrd terms, respectvely The sxth term descrbes explanng away It descrbes the effect of hdden node on node =, when both have a common observed chld (fg 2c) The last term descrbes the effect on node = when ts grandchld s observed (fg 2d) Note, that these equatons are dfferent from Eq 10 When one apples Eq 10 to sgmod belef networs, one requres addtonal approxmatons to compute * (Saul et al, 1996)

6 M A B cpu tme (sec) RMS C D n Fgure 3: Second order approxmaton for fully connected sgmod belef networ of nodes a) nodes are hdden (whte) and nodes are clamped (grey), ; b) CU tme for exact nference (dashed) and second order approxmaton (sold) versus ( ); c) RMS of hdden node exact margnals (sold) and RMS error of second order approxmaton (dashed) versus couplng strength, ( A ) Snce only feed-forward connectons are present, one can order the nodes such that for = Then the frst order mean feld equatons can be solved n one sngle sweep startng wth node 1 The full second order equatons can be solved by teraton, startng wth the frst order soluton 5 Numercal results We llustrate the theory wth two toy problems The frst one s nference n aurtzen s chest clnc model (ASIA), defned on 8 bnary varables 0 (see fgure 1, and (aurtzen and Spegelhalter, 1988) for more detals about the model) We computed exact margnals and approxmate margnals usng the approxmatng methods up to frst (Eq 10) and second order (Eq 11), respectvely The approxmate margnals are determned by sequental teraton of (10) and (11), startng at # for all varables = The maxmal error n the margnals usng the frst and second order method s 0213 and 0061, respectvely We verfed that the sngle-overlap expresson Eq 12 gave smlar results In fg 3, we assess the accuracy and CU tme of the second order approxmaton Eq 16 for sgmod belef networs We generate random fully connected sgmod belef networs wth from a normal dstrbuton wth mean zero and varance and We observe n fg 3b that the computaton tme s very fast: For, we have obtaned convergence n 37 second on a entum 300 Mhz processor The accuracy of the method depends on the sze of the weghts and s computed for a networ of (fg 3c) In (Kappen and Wegernc, 2001), we compare ths approach to Saul s varatonal approach (Saul et al, 1996) and show that our approach s much faster and slghtly more accurate

7 6 Dscusson In ths paper, we computed a second order mean feld approxmaton for drected graphcal models We show that the second order approxmaton gves a sgnfcant mprovement over the frst order result The method does not use explctly that the graph s drected Therefore, the result s equally vald for Marov graphs The complexty of the frst and second order approxmaton s of and, respectvely, wth the number of varables n the largest potental For sngle-overlap graphs, one can rewrte the second order equaton such that the computatonal complexty reduces to Boltzmann machnes and the Asa networ are examples of sngle-overlap graphs For large, addtonal approxmatons are requred, as was proposed by (Saul et al, 1996) for the frst order mean feld equatons It s evdent, that such addtonal approxmatons are then also requred for the second order mean feld equatons It has been reported (Barber and Wegernc, 1999; Wegernc and Kappen, 1999) that smlar numercal mprovements can be obtaned by usng a very dfferent approach, whch s to use an approxmatng dstrbuton that s not factorzed, but stll tractable A promsng way to proceed s therefore to combne both approaches and to do a second order expanson aroud a manfold of dstrbutons wth non-factorzed yet tractable dstrbutons In ths approach the suffcent statstcs of the tractable structure s expanded, rather than the margnal probabltes Acnowledgments Ths research was supported n part by the Dutch Technology Foundaton (STW) References Barber, D and Wegernc, W (1999) Tractable varatonal structures for approxmatng graphcal models In Kearns, M, Solla, S, and Cohn, D, edtors, Advances n Neural Informaton rocessng Systems, volume 11 of Advances n Neural Informaton rocessng Systems, pages MIT ress Kappen, and Rodríguez, F (1998) Effcent learnng n Boltzmann Machnes usng lnear response theory Neural Computaton, 10: Kappen, and Spanjers, (1999) Mean feld theory for asymmetrc neural networs hyscal Revew E, 61: Kappen, and Wegernc, W (2001) Mean feld theory for graphcal models In Saad, D and Opper, M, edtors, Advanced mean feld theory MIT ress aurtzen, S and Spegelhalter, D (1988) ocal computatons wth probabltes on graphcal structures and ther applcaton to expert systems Royal Statstcal socety B, 50: lefa, T (1982) Convergence condton of the TA equaton for the nfnte-range Isng spn glass model ournal of hyscs A, 15: Saul,, aaola, T, and ordan, M (1996) Mean feld theory for sgmod belef networs ournal of artfcal ntellgence research, :61 76 Thouless, D, Anderson,, and almer, R (1977) Soluton of Solvable Model of a Spn Glass hl Mag, 35: Wegernc, W and Kappen, (1999) Approxmatons of bayesan networs through l mnmsaton New Generaton Computng, 18:

NUMERICAL DIFFERENTIATION

NUMERICAL DIFFERENTIATION 1 Introducton Dfferentaton s a method to compute the rate at whch a dependent output y changes wth respect to the change n the ndependent nput x. Ths rate of change s called the