A tutorial on variational Bayesian inference. Charles W. Fox & Stephen J. Roberts

Size: px

Start display at page:

Download "A tutorial on variational Bayesian inference. Charles W. Fox & Stephen J. Roberts"

Jerome Douglas
5 years ago
Views:

1 A tutoral on varatonal Bayesan nference Charles W. Fox & Stephen J. Roberts Artfcal Intellgence Revew An Internatonal Scence and Engneerng Journal ISSN Artf Intell Rev DOI / s

2 Your artcle s protected by copyrght and all rghts are held exclusvely by Sprnger Scence+Busness Meda B.V.. Ths e-offprnt s for personal use only and shall not be selfarchved n electronc repostores. If you wsh to self-archve your work, please use the accepted author s verson for postng to your own webste or your nsttuton s repostory. You may further depost the accepted author s verson on a funder s repostory at a funder s request, provded t s not made publcly avalable untl 12 months after publcaton. 1 23

3 Artf Intell Rev DOI /s A tutoral on varatonal Bayesan nference Charles W. Fox Stephen J. Roberts Sprnger Scence+Busness Meda B.V Abstract Ths tutoral descrbes the mean-feld varatonal Bayesan approxmaton to nference n graphcal models, usng modern machne learnng termnology rather than statstcal physcs concepts. It begns by seekng to fnd an approxmate mean-feld dstrbuton close to the target jont n the KL-dvergence sense. It then derves local node updates and revews the recent Varatonal Message Passng framework. Keywords Varatonal Bayes Mean-feld Tutoral 1 Introducton Varatonal methods have recently become popular n the context of nference problems, (Attas 2000; Wnn and Bshop 2005). Varatonal Bayes s a partcular varatonal method whch ams to fnd some approxmate jont dstrbuton Q(x; θ) over hdden varables x to approxmate the true jont P(x), and defnes closeness as the KL dvergence KL[Q(x; θ) P(x)]. The mean-feld form of VB assumes that Q factorses nto sngle-varable factors, Q(x) = Q (x θ ). The asymetrc KL[Q P] s chosen n VB prncpally to yeld useful computatonal smplfcatons, but can be vewed as preferrng approxmaton where areas of hgh Q are accurate, rather than areas of hgh P. Ths s often useful because f we were to draw samples from, or ntegrate over, Q, then the areas used wll be largely accurate (though they may well mss out areas of hgh P). C. W. Fox (B) Adaptve Behavour Research Group, Unversty of Sheffeld, Sheffeld, UK e-mal: charles.fox@sheffeld.ac.uk S. J. Roberts Pattern Analyss and Machne Learnng Research Group, Department of Engneerng Scence, Unversty of Oxford, Oxford, UK

4 C. W. Fox, S. J. Roberts 1.1 Problem statement We wsh to fnd a set of dstrbutons {Q (x ; θ )} to mnmse the KL-dvergence: KL[Q(x) P(x D)] = dx.q(x) ln Q(x) P(x D) where D s the observed data and x are the unobserved varables and Q(x) = Q (x θ ) We sometmes omt the notatonal dependence of Q on θ for clarty. As the Q are approxmate belefs, they are subject to the normalsaton constrants:. dx Q (x ) = VB approxmates jonts, not margnals Q approxmates the jont, but the ndvdual Q (x ) can be poor approxmatons to the true margnals P (x ). The Q (x ) components should not be expected to resemble even remotely the true margnals, for example when nspectng the computatonal states of network nodes. Ths makes VB consderably harder to debug than algorthms whose node states do have some local nterpretaton: the optmal VB components can be hghly counterntutve and make sense only n the global context. Fgure 1a shows a graphcal model of an extreme example of mean-feld VB breakng down, as a warnng for the algorthms that follow. Suppose a factory produces screws of unknown dameter μ. We know the machnes are very accurate so the precson γ = 1/σ 2 of the dameters s hgh. However no-one has told us what the mean dameter s, except for a drunken engneer n a pub who sad t was 5 mm. We mght have a pror belef π that μ s say 5 ± 4 mm (the 4 mm devaton reflectng our low confdence n the report.) Suppose we are gven a sealed box contanng one screw. What s our belef n ts dameter x? The exact margnal belef n x should be almost dentcal to our belef n μ,.e. 5 ± 4 mm. Fgure 1b shows one standard devaton of the true Gaussan jont P(μ, x) and the best mean-feld Gaussan approxmate jont Q(μ, x).as dscussed above,ths Q s chosen to mnmse error n ts own doman so t appears as a tght Gaussan. Importantly, the margnal Q x (x) s now very small, and not at all equal to the margnal P(x). We emphasse such cases because we found them to be the major cause of debuggng tme durng development. We also want to fnd the model lkelhood, P(D M) for model comparson. Ths quantty wll appear naturally as we try to mnmse the KL dvergence. There are many ways to thnk about the followng dervatons, whch we wll see nvolves a balance of three terms called lower bound, energy and entropy. The dervaton presented here begns by amng to mnmse the above KL dvergence. For example, other equvalent dervatons may begn by amng to maxmse the lower bound.

5 A tutoral on varatonal Bayesan nference Fg. 1 a Graphcal model for a populaton mean problem. Square nodes ndcate observed varables. b True jont P and VB approxmaton Q (a) (b) 1.3 Rewrtng KL optmsaton as an easer problem We wll rewrte the KL equaton n terms that are more tractable. Frst we flp the numerator and denomnator, and flp the sgn: KL[Q(x) P(x D)] = dx.q(x) ln Q(x) P(x D) = dx.q(x) ln P(x D) Q(x) Next, replace the condtonal P(x D) wth a jont P(x, D) and a pror P(D). Thereason for makng ths rewrte s that for Bayesan networks wth exponental famly nodes, the log P(x, D) term wll be a be a very smple sum of node energy terms, whereas log P(x D) s more complcated. Ths wll smplfy later computatons. P(x, D) = P(x D) P(D) = ln P(x D) = ln P(x, D) ln P(D) ( ) P(x, D) KL[Q(x) P(x D)] = dx Q(x) ln ln P(D) Q(x) The log P(D) term does not nvolve Q so we can gnore t for the purposes of our mnmsaton. Fnally, defne So that L[Q(x)] = ( dx Q(x) ln ) P(x, D) Q(x) KL[Q(x) P(x D)] = L + ln P(D) So to mnmze the KL dvergence, we must maxmze L. The maxmsaton s stll subject to the normalsaton constrants:. dx Q (x ) = 1

6 C. W. Fox, S. J. Roberts L[Q(x)] are lower bounds on the model log-lkelhood, P(D) = P(D M) (where we generally drop the M notaton as we are workng wth a sngle model only). The best bound s thus acheved when L[Q(x)] s maxmsed over Q. The reason for L beng a lower bound s seen by rearrangng as: ln P(D) = L[Q(x)]+ KL[Q(x D) P(x)] Thus when the KL-dvergence s zero (a perfect ft), L s equal to the model log-lkelhood. When the ft s not perfect, the KL-dvergence s always postve and so L[Q(x)] < ln P(D). Another rearrangement gves L[Q(x)] = ln P(D) KL[Q(x D) P(x)] showng that the KL dvergence s the error between L and ln P(D). 1.4 Soluton of free energy optmsaton WenowwshtofndQ to maxmse the lower bound, subject to normalsaton constrants, P(x, D) L[Q(x)] = dxq(x) ln Q(x) = dxq(x) log P(x, D) dxq(x) ln Q(x) = E(x, D) Q(x) + H[Q(x)] where we defne energy as E = ln P and entropy 1 as H[Q(x)] = dxq(x) ln Q(x).For exponental models, ths wll become a convenent sum of lnear functons. By the mean feld assumpton: ( ) ( ) L[Q(x)] = dx Q (x ) E(x, D) dx Q k (x k ) ln Q (x ) (1) Consder the entropy (rghtmost) term. We can brng out the sum: ( ) dx Q k (x k ) ln Q (x ) Consder the parttons x ={x, x } where x = x\x. = dx d x Q ( x )Q (x ) ln Q (x ) = dx Q (x ) ln Q (x ) Q( x ) = dx Q (x ) ln Q (x ) k k Substtutng ths nto the rght term of Eq. 1: ( ) L[Q(x)] = dx Q(x ) E(x, D) 1 Ths s Shannon entropy, used by conventon. dx.q(x ) ln Q(x ) (2)

7 A tutoral on varatonal Bayesan nference Now look at and rearrange the energy (left) term of Eq. 2, agan separatng out one varable: ( ) dx Q (x ) E(x, D) = dx Q (x ) d x.q( x )E(x, D) = dx Q (x ) E(x, D) Q( x ) = dx Q (x ) ln exp E(x, D) Q( x ) = dx Q (x ) ln Q (x ) + ln Z wherewehavedefnedq (x ) = Z 1 exp E(x, D) Q( x ) and Z normalses Q (x ). Substtutng ths new form of the energy back nto Eq. 2 yelds L[Q(x)] = dx Q (x ) ln Q (x ) dx Q (x ) ln Q (x ) + ln Z Separate out the entropy for H = H[Q (x )] from the rest of the entropy sum: { } L[Q(x)] = dx Q (x ) ln Q (x ) dx Q (x ) ln Q (x ) + H[(Q( x )]+ln Z Consder the terms n the brackets: dx Q (x ) ln Q (x ) dx Q (x ) ln Q (x ) = dx Q (x ) ln Q (x ) Q (x ) = KL[Q (x ) Q (x )] What a lucky co-ncdence! Though we started by tryng to mnmse the KL-dvergence between large jont dstrbutons (whch s hard), we have converted the problem to that of mnmsng KL-dvergences between ndvdual 1D dstrbutons (whch s easer). Wrte: L[Q(x)] = KL[Q (x ) Q (x )]+H[Q ( x )]+ln Z Thus L depends on each ndvdual Q only through the KL term. We wsh to maxmse L wth respect to each Q, subject to the constrant that all Q are normalzed to unty. Ths could be acheved by Lagrange multplers and functonal dfferentaton: δ δq (x ) KL[Q (x ) Q (x )] λ Q (x )dx 1 := 0 s A long algebrac dervaton would then eventually lead to a Gbbs dstrbuton. However, thanks to the KL form rearrangement we do not need to perform any of ths, because we can see mmedately that L wll be maxmsed when the KL dvergence s zero, hence when Q(x ) = Q (x ) (The normalsaton constrant on Q s satsfed thanks to the ncluson of Z n the prevous defnton of Q ). Expandng back the defnton gves the optmal Q to be Q(x ) = 1 Z exp E(x, x, D) Q( x ) where E(x, x, D) = log P(x, x, D) s the energy.

8 C. W. Fox, S. J. Roberts Fg. 2 Graphcal model for varatonal example. Square nodes are observed 1.5 Soluton as teratve update equatons Convertng the equaton above nto an teratve update equaton gves: Q(x ) 1 Z exp E(x, x, D) Q( x ) where x j s a hdden node to be updated; D are the observed data, and x = (x\x ) are the other hdden nodes. These updates gve us an EM-lke algorthm, optmsng one node at a tme n the hope of convergng to a local mnmum. Graphcal models defne condtonal ndependence relatonshps between nodes, so for models havng such structure there exsts a Markov blanket set of nodes mb(x ) for each node x such that P(x x ) = P(x mb(x )). For undrected graphcal model lnks, the Markov blanket s the net of neghbourng nodes of x ; for drected Bayesan networks t s the neghbours of x augmented wth the set of co-parents of x. By the defnton of Markov blankets, we may wrte ln Q(x ) ln P(x, mb(x )), D Q(mb(x )) where mb(x ) s the Markov blanket of the node x of nterest. 1.6 Example: varatonal Bayesan mean update Untl recently, varatonal mplementatons have conssted of calculatng the teratve update equatons by hand for each specfc network model. These calculatons requred much algebra and were error-prone. We consder the smple graphcal model shown n Fg. 2: the task s to nfer the mean μ component of the posteror jont of a Gaussan populaton of unknown precson γ gven a set of M observatons D ={D } =1 M drawn from ths populaton. Makng the mean-feld approxmaton and assumng conjugate exponental prors we have: Q(x) = Q(μ)Q(γ ) = N(μ; m,β 1 )Γ (γ ; a, b) where m,β,a and b are constants specfyng conjugate prors on the populaton parameters. The update equaton s ln Q(x ) ln P(x, mb(x )), D Q(mb(x ))

9 A tutoral on varatonal Bayesan nference Substtutng the varables for the mean update, x = μ: ln Q(μ) dγγ(γ; a, b) ln P(μ, γ, {D }) = dγγ(γ; a, b) ln N(μ; m,β 1 )Γ (γ ; a, b) N(D μ, γ ) { = dγγ(γ; a, b) ln N(μ; m,β 1 ) + ln Γ(γ; a, b) + } ln N(D μ, γ ) = dγγ(γ; a, b) ln N(μ; m,β 1 ) + dγγ(γ; a, b) ln Γ(γ; a, b)] + dγγ(γ; a, b) ln N(D μ, γ ) Ths smplfes greatly. Frst, ntegrands not dependent on μ can be dscarded and replaced by a normalsng Z, snce we are only nterested n the PDF for Q(μ): ln Q(μ) = dγγ(γ; a, b) ln N(μ; m,β 1 ) + dγγ(γ; a, b) ln N(D μ, γ ) + ln 1 Z The left term does depend on μ but not on γ, so the ntegral over the normalsed Q(γ ) has no effect. Its ntegral s smply ts ntegrand, so we can wrte ln Q(μ) = ln N(μ; m,β 1 ) + dγγ(γ; a, b) ln N(D μ, γ ) + ln 1 Z We next consder the ntegral contanng data D. Wrtng out ts log Gaussan energy expresson n full, ths term becomes dγγ(γ; a, b) ln N(D μ, γ ) = dγγ(γ; a, b)[c + (D μ)γ (D μ)] where c s ts normalsng constant. Rewrtng up to normalsaton gves [ (D μ) ] dγγ(γ; a, b)γ (D μ) The requred ntegral s now just the expectaton (frst moment) of a Gamma dstrbuton, whch can be found n standard statstcs tables: γ =E[Γ(γ; a, b)] =ab 1 (We notate pre- and post-multplcatons by (D μ) rather than a sngle multplcaton by (D μ) 2 so that the multvarate Wshart case follows the same dervaton and notaton.) The term s now (D μ) γ (D μ)

10 C. W. Fox, S. J. Roberts Substtutng ths back nto the equaton for ln Q(μ) gves ln Q(μ) = ln N(μ; m,β 1 ) + (D μ) γ (D μ) + ln 1 Z { } Q(μ) = 1 Z N(μ; m,β 1 ) exp (D μ) γ (D μ) Q(μ) = 1 Z N(μ; m,β 1 ) N(D μ, γ ) Swtchng round the dependent varable we obtan the followng. (For μ n Gaussan dstrbuton, the flpped result s stll Gaussan. For other dstrbutons, t wll be conjugate.) Q(μ) = 1 Z N(μ; m,β 1 ) N(μ D, γ ) Fnally we can use the standard equaton for product of Gaussans to gve wth Q(μ) = N(μ m,β 1 ) β = β + M γ m = β 1 (βm + γ ) M x The above llustrates how the general VB updates can be transformed nto partcular updates for graphcal models. The Q(μ) component s meanngless by tself as VB ams to approxmate the full jont rather than local margnals, so a more useful analyss would repeat the above to obtan the Q(γ ) updates as well requrng even more algebra. We see that ths process s tme-consumng and error prone even for very smple networks such as the Gaussan populaton used here. 1.7 Varatonal Bayes wth message passng The prevous method of by-hand dervaton was tme-consumng and tedous, though untl recently was state-of-the-art. However the recent varatonal message passng (VMP) algorthm (Bshop et al. 2002; Wnn and Bshop 2005) has shown how to automate these dervatons n the case of conjugate-exponental networks. For such networks, the updates all have a standard form, nvolvng the suffcent statstcs and natural parameters of the partcular node types. For unavodable non-conjugate-exponental nodes (such as mxture models n partcular) t s possble to make further approxmatons to brng them nto the standard form. For conjugate-exponental networks, VMP should now be the standard method for varatonal Bayesan nference, replacng dervatons by hand. For non-conjugate-exponental networks, VMP may stll be useful f fast approxmatons are requred at the expense of accuracy. =1

11 A tutoral on varatonal Bayesan nference 1.8 VMP detals A standard theorem (Bernardo and Smth 2000) about exponental famly dstrbutons shows that the expectaton of suffcent statstcs are gven smply by: u(x ) = φ g(φ) φ where the Dell symbol ( ) means we form a vector whose rth component s the dervatve of g wth respect to the rth component of the vector φ. We wll wrte x s set of parents as pa(x ); ts set of chldren as ch(x ); ts set of co-parents wth respect to partcular chld ch as cop(x ; ch); and ts set of co-parents over all chldren as cop(x ). We wsh to compute the varatonal Bayesan update as n the prevous secton: Q (x ) ln P(x, mb(x )), D Q(mb(x )) Assumng the effects of D are already n the Markov blanket nodes, and separatng, ths becomes ln P(pa(x )) + ln P(cop(x )) + ln P(x pa(x )) + ln P(ch(x ) x, cop(x )) Q(mb(x )) Smplfyng and droppng constant parent and co-parent terms: = ln P(x pa(x )) Q(pa(x )) + ln P(ch(x ) x, cop(x )) Q(ch(x ),cop(x )) The chldren separate: = ln P(x pa(x )) Q(pa(x )) + ch ch(x ) We wll consder the two parts one at a tme. ln P(ch x, cop(x ; ch)) Q(ch,cop(x ;ch)) Messages from parents A conjugate-exponental node x s parametrsed by a natural parameter vector φ.bythe defnton of such nodes, ln P(x pa(x )) Q(pa(x )) = φ u(x ) + f (x ) + g (φ ) Q(pa(x )) = φ Q(pa(x ))u (x ) + f (x ) + g(φ ) Q(pa(x )) As φ and g are mult-lnear functons of the parent suffcent statstcs (by constructon), and usng the mean feld assumpton, we can smply take ther formulæ (defned as condtonal on sngle values of the parents) and substtute the expectatons for the suffcent statstcs, to get the expectaton of the whole expresson as requred. So parents of x need only to send ther suffcent statstc expectatons to x as messages. (A concrete example s shown n Sect. 1.9.) Messages to parents A key property of the exponental famly s that we can multply (fuse) smlar dstrbutons by addng ther natural parameter vectors φ: exp{φ 1 u(x ) + f (x ) + g(φ 1 )} exp{φ 1 u(x ) + f (x ) + g(φ 1 )} = exp{(φ 1 + φ 2 )u(x ) + f (x ) + g(φ 1 + φ 2 )}

12 C. W. Fox, S. J. Roberts A second property s that by conjugacy, φ and g are also mult-lnear n parental suffcent statstcs. So we can always rearrange the formula by fndng functons φ j, f j, g j to make t look lke a functon of a parent x j pa(x ): ln P(x pa(x )) Q(pa(x )) = φ j u j (x j ) + f j (x j ) + g j (φ j ) Q(pa(x )) As before, we may handle the expectaton by usng the mult-lnear property to push all the expectatons tght around the suffcent statstcs. So from the pont of vew of the parent, ths s wrtten n terms of the suffcent statstc expectatons of ts chld and co-parents. We can thus pass a lkelhood message consstng of ( ) φ j u(x ), { u(cop) } cop cop(x j ;x ) The parent may then smply add these to ts pror parameters, by the frst property. Observed data nodes D may be treated as Delta dstrbutons whch send standard messages to ther parents and chldren. 1.9 Example: mean, precson and data usng VMP To demonstrate the power of the VMP formalsm, we consder the same scenaro as used to hand-derve the VB updates n Sect Usng VMP we can quckly substtute the partcular Gaussan and Gamma dstrbutons nto the VMP update equatons and quckly obtan all network messages for mean, precson, and data nodes as follows. (Messages to D are not requred n ths partcular example, but are useful n general for makng nferences about unobserved data.) The exponental forms used here are standard (Bernardo and Smth 2000) Mean node (wth known pror parameters) Begnnng wth the conjugate-exponental form of the Gaussan dstrbuton: [ ] [ ] mβ μ ln P(μ m,β)= β/2 μ 2 g(m,β) wth g(m,β)= 1 2 (ln β βm2 ln 2π) The message to chld D s the expectaton of suffcent statstcs: [ ] [ ] μ μ μ 2 = φ g(φ) φ = μ 2 + β Computng the log-lkelhood bound usng VMP VMP also makes computaton of the log-lkelhood bound smple. Recall that ln P(D M) L[Q(x)] P(x, D) L[Q(x)] = dxq(x) log Q(x) = log P(x, D) Q(x) log Q(x) Q(x)

13 A tutoral on varatonal Bayesan nference Wrtng as the sum of ndvdual node contrbutons from the unverse of all (data and hdden) nodes Ω = x D, = log P(ω pa(ω)) Q(ω,pa(ω)) log Q(ω) Q(ω) = ω Ω ω Ω L ω wth L ω = log P(ω pa(ω)) Q(ω,pa(ω)) log Q(ω) Q(ω) = φ π ω u(ω) + f (ω) + g(φπ ω ) Q(ω,pa(ω)) φ ω u(ω) + f (ω) + g(φ ω ) Q(ω) where φ π are the pror natural parameters condtoned only on the parents, and φ are the posteror parameters, after chld message are fused. Q(ω) uses the posteror parameters. By mult-lnearty we can push n the expectatons and smplfy to obtan: L ω = ( φ π ω Q(pa(ω)) + φ ω ) u(ω) Q(ω) + g(φ π ω ) Q(pa(ω)) + g(φ ω ) whch s smple to compute locally, from each node s receved messages. (The term φ π ω Q(pa(ω)) s computed by substtutng n the receved parent suffcent statstcs expectatons nto the conjugate-exponental formula for φ ω. By mult-lnearty, the expectaton can be pushed nto the suffcent statstcs.) The global model log-lkelhood bound s computed by summng the contrbutons from all nodes, both hdden and observed Summary In ths tutoral we have seen how varatonal methods may be used to approxmate jont posterors wth mean-feld dstrbutons. An long-hand calculaton of varatonal nference was shown, then the more general varatonal message passng framework ntroduced, whch greatly smplfes calculatons for conjugate-exponental networks. Varatonal methods are not approprate when the margnals or the correlatons structure of the jont are requred, but are useful n model comparson tasks when the jont s ntegrated out. References Attas H (2000) A varatonal Bayesan framework for graphcal models. In: Advances n neural nformaton processng systems. MIT Press Bernardo JM, Smth AFM (2000) Bayesan theory. Wley, London Bshop CM, Wnn JM, Spegelhalter D (2002) VIBES: a varatonal nference engne for Bayesan networks. In: Advances n neural nformaton processng systems Wnn J, Bshop C (2005) Varatonal message passng. J Mach Learn Res 6:

Expectation propagation

Expectation propagation Expectaton propagaton Lloyd Ellott May 17, 2011 Suppose p(x) s a pdf and we have a factorzaton p(x) = 1 Z n f (x). (1) =1 Expectaton propagaton s an nference algorthm desgned to approxmate the factors