Variational Bayesian Theory

Size: px
Start display at page:

Download "Variational Bayesian Theory"

Transcription

1 Chapter 2 Varatonal Bayesan Theory 2.1 Introducton Ths chapter covers the majorty of the theory for varatonal Bayesan learnng that wll be used n rest of ths thess. It s ntended to gve the reader a context for the use of varatonal methods as well as a nsght nto ther general applcablty and usefulness. In a model selecton task the role of a Bayesan s to calculate the posteror dstrbuton over a set of models gven some a pror knowledge and some new observatons (data). The knowledge s represented n the form of a pror over model structures p(m), and ther parameters p( m) whch defne the probablstc dependences between the varables n the model. By Bayes rule, the posteror over models m havng seen data y s gven by: p(m y) = p(m)p(y m) p(y). (2.1) The second term n the numerator s the margnal lkelhood or evdence for a model m, and s the key quantty for Bayesan model selecton: p(y m) = d p( m)p(y, m). (2.2) For each model structure we can compute the posteror dstrbuton over parameters: p( y, m) = p( m)p(y, m) p(y m). (2.3) 44

2 2.1. Introducton We mght also be nterested n calculatng other related quanttes, such as the predctve densty of a new datum y gven a data set y = {y 1,..., y n }: p(y y, m) = d p( y, m) p(y, y, m), (2.4) whch can be smplfed nto p(y y, m) = d p( y, m) p(y, m) (2.5) f y s condtonally ndependent of y gven. We also may be nterested n calculatng the posteror dstrbuton of a hdden varable, x, assocated wth the new observaton y p(x y, y, m) d p( y, m) p(x, y, m). (2.6) The smplest way to approxmate the above ntegrals s to estmate the value of the ntegrand at a sngle pont estmate of, such as the maxmum lkelhood (ML) or the maxmum a posteror (MAP) estmates, whch am to maxmse respectvely the second and both terms of the ntegrand n (2.2), ML = arg max MAP = arg max p(y, m) (2.7) p( m)p(y, m). (2.8) ML and MAP examne only probablty densty, rather than mass, and so can neglect potentally large contrbutons to the ntegral. A more prncpled approach s to estmate the ntegral numercally by evaluatng the ntegrand at many dfferent va Monte Carlo methods. In the lmt of an nfnte number of samples of ths produces an accurate result, but despte ngenous attempts to curb the curse of dmensonalty n usng methods such as Markov chan Monte Carlo, these methods reman prohbtvely computatonally ntensve n nterestng models. These methods were revewed n the last chapter, and the bulk of ths chapter concentrates on a thrd way of approxmatng the ntegral, usng varatonal methods. The key to the varatonal method s to approxmate the ntegral wth a smpler form that s tractable, formng a lower or upper bound. The ntegraton then translates nto the mplementatonally smpler problem of bound optmsaton: makng the bound as tght as possble to the true value. We begn n secton 2.2 by descrbng how varatonal methods can be used to derve the wellknown expectaton-maxmsaton (EM) algorthm for learnng the maxmum lkelhood (ML) parameters of a model. In secton 2.3 we concentrate on the Bayesan methodology, n whch prors are placed on the parameters of the model, and ther uncertanty ntegrated over to gve the margnal lkelhood (2.2). We then generalse the varatonal procedure to yeld the varatonal Bayesan EM (VBEM) algorthm, whch teratvely optmses a lower bound on ths margnal 45

3 2.2. Varatonal methods for ML / MAP learnng lkelhood. In analogy to the EM algorthm, the teratons consst of a varatonal Bayesan E (VBE) step n whch the hdden varables are nferred usng an ensemble of models accordng to ther posteror probablty, and a varatonal Bayesan M (VBM) step n whch a posteror dstrbuton over model parameters s nferred. In secton 2.4 we specalse ths algorthm to a large class of models whch we call conjugate-exponental (CE): we present the varatonal Bayesan EM algorthm for CE models and dscuss the mplcatons for both drected graphs (Bayesan networks) and undrected graphs (Markov networks) n secton 2.5. In partcular we show that we can ncorporate exstng propagaton algorthms nto the varatonal Bayesan framework and that the complexty of nference for the varatonal Bayesan treatment s approxmately the same as for the ML scenaro. In secton 2.6 we compare VB to the BIC and Cheeseman-Stutz crtera, and fnally summarse n secton Varatonal methods for ML / MAP learnng In ths secton we revew the dervaton of the EM algorthm for probablstc models wth hdden varables. The algorthm s derved usng a varatonal approach, and has exact and approxmate versons. We nvestgate themes on convexty, computatonal tractablty, and the Kullback- Lebler dvergence to gve a deeper understandng of the EM algorthm. The majorty of the secton concentrates on maxmum lkelhood (ML) learnng of the parameters; at the end we present the smple extenson to maxmum a posteror (MAP) learnng. The hope s that ths secton provdes a good steppng-stone on to the varatonal Bayesan EM algorthm that s presented n the subsequent sectons and used throughout the rest of ths thess The scenaro for parameter learnng Consder a model wth hdden varables x and observed varables y. The parameters descrbng the (potentally) stochastc dependences between varables are gven by. In partcular consder the generatve model that produces a dataset y = {y 1,..., y n } consstng of n ndependent and dentcally dstrbuted (..d.) tems, generated usng a set of hdden varables x = {x 1,..., x n } such that the lkelhood can be wrtten as a functon of n the followng way: n n p(y ) = p(y ) = dx p(x, y ). (2.9) The ntegraton over hdden varables x s requred to form the lkelhood of the parameters, as a functon of just the observed data y. We have assumed that the hdden varables are contnuous as opposed to dscrete (hence an ntegral rather than a summaton), but we do so wthout loss of generalty. As a pont of nomenclature, note that we use x and y to denote collectons of x hdden and y observed varables respectvely: x = {x 1,..., x x }, and 46

4 2.2. Varatonal methods for ML / MAP learnng y = {y 1,..., y y }. We use notaton to denote the sze of the collecton of varables. ML learnng seeks to fnd the parameter settng ML that maxmses ths lkelhood, or equvalently the logarthm of ths lkelhood, L() ln p(y ) = n ln p(y ) = n ln dx p(x, y ) (2.10) so defnng ML arg max L(). (2.11) To keep the dervatons clear, we wrte L as a functon of only; the dependence on y s mplct. In Bayesan networks wthout hdden varables and wth ndependent parameters, the log-lkelhood decomposes nto local terms on each y j, and so fndng the settng of each parameter of the model that maxmses the lkelhood s straghtforward. Unfortunately, f some of the varables are hdden ths wll n general nduce dependences between all the parameters of the model and so make maxmsng (2.10) dffcult. Moreover, for models wth many hdden varables, the ntegral (or sum) over x can be ntractable. We smplfy the problem of maxmsng L() wth respect to by ntroducng an auxlary dstrbuton over the hdden varables. Any probablty dstrbuton q x (x) over the hdden varables gves rse to a lower bound on L. In fact, for each data pont y we use a dstnct dstrbuton q x (x ) over the hdden varables to obtan the lower bound: L() = = = ln ln dx p(x, y ) (2.12) dx q x (x ) p(x, y ) q x (x ) dx q x (x ) ln p(x, y ) q x (x ) dx q x (x ) ln p(x, y ) (2.13) (2.14) dx q x (x ) ln q x (x ) (2.15) F(q x1 (x 1 ),..., q xn (x n ), ) (2.16) where we have made use of Jensen s nequalty (Jensen, 1906) whch follows from the fact that the log functon s concave. F(q x (x), ) s a lower bound on L() and s a functonal of the free dstrbutons q x (x ) and of (the dependence on y s left mplct). Here we use q x (x) to mean the set {q x (x )} n. Defnng the energy of a global confguraton (x, y) to be ln p(x, y ), the lower bound F(q x (x), ) L() s the negatve of a quantty known n statstcal physcs as the free energy: the expected energy under q x (x) mnus the entropy of q x (x) (Feynman, 1972; Neal and Hnton, 1998). 47

5 2.2. Varatonal methods for ML / MAP learnng EM for unconstraned (exact) optmsaton The Expectaton-Maxmzaton (EM) algorthm (Baum et al., 1970; Dempster et al., 1977) alternates between an E step, whch nfers posteror dstrbutons over hdden varables gven a current parameter settng, and an M step, whch maxmses L() wth respect to gven the statstcs gathered from the E step. Such a set of updates can be derved usng the lower bound: at each teraton, the E step maxmses F(q x (x), ) wth respect to each of the q x (x ), and the M step does so wth respect to. Mathematcally speakng, usng a superscrpt (t) to denote teraton number, startng from some ntal parameters (0), the update equatons would be: E step: M step: q (t+1) x arg max F(q x (x), (t) ), {1,..., n}, (2.17) q x (t+1) arg max F(q (t+1) x (x), ). (2.18) For the E step, t turns out that the maxmum over q x (x ) of the bound (2.14) s obtaned by settng q (t+1) x (x ) = p(x y, (t) ),, (2.19) at whch pont the bound becomes an equalty. Ths can be proven by drect substtuton of (2.19) nto (2.14): F(q x (t+1) (x), (t) ) = = = = dx q (t+1) x (x ) ln p(x, y (t) ) q (t+1) x (x ) dx p(x y, (t) ) ln p(x, y (t) ) p(x y, (t) ) dx p(x y, (t) ) ln p(y (t) ) p(x y, (t) ) p(x y, (t) ) (2.20) (2.21) (2.22) dx p(x y, (t) ) ln p(y (t) ) (2.23) = ln p(y (t) ) = L( (t) ), (2.24) where the last lne follows as ln p(y ) s not a functon of x. After ths E step the bound s tght. The same result can be obtaned by functonally dfferentatng F(q x (x), ) wth respect to q x (x ), and settng to zero, subject to the normalsaton constrants: dx q x (x ) = 1,. (2.25) 48

6 2.2. Varatonal methods for ML / MAP learnng The constrants on each q x (x ) can be mplemented usng Lagrange multplers {λ } n, formng the new functonal: F(q x (x), ) = F(q x (x), ) + λ [ ] dx q x (x ) 1. (2.26) We then take the functonal dervatve of ths expresson wth respect to each q x (x ) and equate to zero, obtanng the followng q x (x ) F(q x (x), (t) ) = ln p(x, y (t) ) ln q x (x ) 1 + λ = 0 (2.27) = q (t+1) x (x ) = exp ( 1 + λ ) p(x, y (t) ) (2.28) where each λ s related to the normalsaton constant: λ = 1 ln = p(x y, (t) ),, (2.29) dx p(x, y (t) ),. (2.30) In the remanng dervatons n ths thess we always enforce normalsaton constrants usng Lagrange multpler terms, although they may not always be explctly wrtten. The M step s acheved by smply settng dervatves of (2.14) wth respect to to zero, whch s the same as optmsng the expected energy term n (2.15) snce the entropy of the hdden state dstrbuton q x (x) s not a functon of : M step: (t+1) arg max dx p(x y, (t) ) ln p(x, y ). (2.31) Note that the optmsaton s over the second n the ntegrand, whlst holdng p(x y, (t) ) fxed. Snce F(q x (t+1) (x), (t) ) = L( (t) ) at the begnnng of each M step, and snce the E step does not change the parameters, the lkelhood s guaranteed not to decrease after each combned EM step. Ths s the well known lower bound nterpretaton of EM: F(q x (x), ) s an auxlary functon whch lower bounds L() for any q x (x), attanng equalty after each E step. These steps are shown schematcally n fgure 2.1. Here we have expressed the E step as obtanng the full dstrbuton over the hdden varables for each data pont. However we note that, n general, the M step may requre only a few statstcs of the hdden varables, so only these need be computed n the E step EM wth constraned (approxmate) optmsaton Unfortunately, n many nterestng models the data are explaned by multple nteractng hdden varables whch can result n ntractable posteror dstrbutons (Wllams and Hnton, 1991; 49

7 2.2. Varatonal methods for ML / MAP learnng log lkelhood ln p(y (t) ) h KL q x (t) p(x y, (t) ) E step makes the lower bound tght ln p(y (t) ) = F(q x (t+1), (t) ) h KL q x (t+1) p(x y, (t) ) = 0 new log lkelhood new lower bound ln p(y (t+1) ) h KL q x (t+1) p(x y, (t+1) ) F(q (t+1) x, (t+1) ) lower bound F(q (t) x, (t) ) E step M step Fgure 2.1: The varatonal nterpretaton of EM for maxmum lkelhood learnng. In the E step the hdden varable varatonal posteror s set to the exact posteror p(x y, (t) ), makng the bound tght. In the M step the parameters are set to maxmse the lower bound F(q x (t+1), ) whle holdng the dstrbuton over hdden varables q x (t+1) (x) fxed. Neal, 1992; Hnton and Zemel, 1994; Ghahraman and Jordan, 1997; Ghahraman and Hnton, 2000). In the varatonal approach we can constran the posteror dstrbutons to be of a partcular tractable form, for example factorsed over the varable x = {x j } x j=1. Usng calculus of varatons we can stll optmse F(q x (x), ) as a functonal of constraned dstrbutons q x (x ). The M step, whch optmses, s conceptually dentcal to that descrbed n the prevous subsecton, except that t s based on suffcent statstcs calculated wth respect to the constraned posteror q x (x ) nstead of the exact posteror. We can wrte the lower bound F(q x (x), ) as F(q x (x), ) = = = dx q x (x ) ln p(x, y ) q x (x ) dx q x (x ) ln p(y ) + ln p(y ) dx q x (x ) ln dx q x (x ) ln p(x y, ) q x (x ) (2.32) (2.33) q x (x ) p(x y, ). (2.34) Thus n the E step, maxmsng F(q x (x), ) wth respect to q x (x ) s equvalent to mnmsng the followng quantty dx q x (x ) ln q x (x ) p(x y, ) KL [q x (x ) p(x y, )] (2.35) 0, (2.36) whch s the Kullback-Lebler dvergence between the varatonal dstrbuton q x (x ) and the exact hdden varable posteror p(x y, ). As s shown n fgure 2.2, the E step does not 50

8 2.2. Varatonal methods for ML / MAP learnng log lkelhood ln p(y (t) ) h KL q x (t) p(x y, (t) ) constraned E step, so lower bound s no longer tght ln p(y (t) ) h KL q x (t+1) p(x y, (t) ) F(q x (t+1), (t) ) new log lkelhood new lower bound ln p(y (t+1) ) h KL q x (t+1) p(x y, (t+1) ) F(q (t+1) x, (t+1) ) lower bound F(q (t) x, (t) ) E step M step Fgure 2.2: The varatonal nterpretaton of constraned EM for maxmum lkelhood learnng. [ In the E step the hdden ] varable varatonal posteror s set to that whch mnmses KL q x (x) p(x y, (t) ), subject to q x (x) lyng n the famly of constraned dstrbutons. In the M step the parameters are set to maxmse the lower bound F(q x (t+1), ) gven the current dstrbuton over hdden varables. generally result n the bound becomng an equalty, unless of course the exact posteror les n the famly of constraned posterors q x (x). The M step looks very smlar to (2.31), but s based on the current varatonal posteror over hdden varables: M step: (t+1) arg max dx q (t+1) x (x ) ln p(x, y ). (2.37) One can choose q x (x ) to be n a partcular parametersed famly: q x (x ) = q x (x λ ) (2.38) where λ = {λ 1,..., λ r } are r varatonal parameters for each datum. If we constran each q x (x λ ) to have easly computable moments (e.g. a Gaussan), and especally f ln p(x y, ) s polynomal n x, then we can compute the KL dvergence up to a constant and, more mportantly, can take ts dervatves wth respect to the set of varatonal parameters λ of each q x (x ) dstrbuton to perform the constraned E step. The E step of the varatonal EM algorthm therefore conssts of a sub-loop n whch each of the q x (x λ ) s optmsed by takng dervatves wth respect to each λ s, for s = 1,..., r. 51

9 2.2. Varatonal methods for ML / MAP learnng The mean feld approxmaton The mean feld approxmaton s the case n whch each q x (x ) s fully factorsed over the hdden varables: x q x (x ) = q xj (x j ). (2.39) j=1 In ths case the expresson for F(q x (x), ) gven by (2.32) becomes: F(q x (x), ) = x x x dx q xj (x j ) ln p(x, y ) q xj (x j ) ln q xj (x j ) j=1 j=1 j=1 (2.40) = x x dx q xj (x j ) ln p(x, y ) q xj (x j ) ln q xj (x j ). j=1 j=1 (2.41) Usng a Lagrange multpler to enforce normalsaton of the each of the approxmate posterors, we take the functonal dervatve of ths form wth respect to each q xj (x j ) and equate to zero, obtanng: q xj (x j ) = 1 exp Z j dx /j x q xj (x j ) ln p(x, y ), (2.42) j /j for each data pont {1,..., n}, and each varatonal factorsed component j {1,..., x }. We use the notaton dx /j to denote the element of ntegraton for all tems n x except x j, and the notaton j /j to denote a product of all terms excludng j. For the th datum, t s clear that the update equaton (2.42) appled to each hdden varable j n turn represents a set of coupled equatons for the approxmate posteror over each hdden varable. These fxed pont equatons are called mean-feld equatons by analogy to such methods n statstcal physcs. Examples of these varatonal approxmatons can be found n the followng: Ghahraman (1995); Saul et al. (1996); Jaakkola (1997); Ghahraman and Jordan (1997). EM for maxmum a posteror learnng In MAP learnng the parameter optmsaton ncludes pror nformaton about the parameters p(), and the M step seeks to fnd MAP arg max p()p(y ). (2.43) 52

10 2.3. Varatonal methods for Bayesan learnng In the case of an exact E step, the M step s smply augmented to: M step: (t+1) arg max [ ln p() + dx p(x y, (t) ) ln p(x, y ) ]. (2.44) In the case of a constraned approxmate E step, the M step s gven by M step: (t+1) arg max [ ln p() + dx q (t+1) x (x ) ln p(x, y ) ]. (2.45) However, as mentoned n secton 1.3.1, we reterate that an undesrable feature of MAP estmaton s that t s nherently bass-dependent: t s always possble to fnd a bass n whch any partcular s the MAP soluton, provded has non-zero pror probablty. 2.3 Varatonal methods for Bayesan learnng In ths secton we show how to extend the above treatment to use varatonal methods to approxmate the ntegrals requred for Bayesan learnng. By treatng the parameters as unknown quanttes as well as the hdden varables, there are now correlatons between the parameters and hdden varables n the posteror. The basc dea n the VB framework s to approxmate the dstrbuton over both hdden varables and parameters wth a smpler dstrbuton, usually one whch assumes that the hdden states and parameters are ndependent gven the data. There are two man goals n Bayesan learnng. The frst s approxmatng the margnal lkelhood p(y m) n order to perform model comparson. The second s approxmatng the posteror dstrbuton over the parameters of a model p( y, m), whch can then be used for predcton Dervng the learnng rules As before, let y denote the observed varables, x denote the hdden varables, and denote the parameters. We assume a pror dstrbuton over parameters p( m) condtonal on the model m. The margnal lkelhood of a model, p(y m), can be lower bounded by ntroducng any 53

11 2.3. Varatonal methods for Bayesan learnng dstrbuton over both latent varables and parameters whch has support where p(x, y, m) does, by appealng to Jensen s nequalty once more: ln p(y m) = ln d dx p(x, y, m) (2.46) p(x, y, m) = ln d dx q(x, ) (2.47) q(x, ) p(x, y, m) d dx q(x, ) ln. (2.48) q(x, ) Maxmsng ths lower bound wth respect to the free dstrbuton q(x, ) results n q(x, ) = p(x, y, m) whch when substtuted above turns the nequalty nto an equalty (n exact analogy wth (2.19)). Ths does not smplfy the problem snce evaluatng the exact posteror dstrbuton p(x, y, m) requres knowng ts normalsng constant, the margnal lkelhood. Instead we constran the posteror to be a smpler, factorsed (separable) approxmaton to q(x, ) q x (x)q (): ln p(y m) = p(x, y, m) d dx q x (x)q () ln q x (x)q () [ p(x, y, m) d q () dx q x (x) ln + ln q x (x) ] p( m) q () (2.49) (2.50) = F m (q x (x), q ()) (2.51) = F m (q x1 (x 1 ),..., q xn (x n ), q ()), (2.52) where the last equalty s a consequence of the data y arrvng..d. (ths s shown n theorem 2.1 below). The quantty F m s a functonal of the free dstrbutons, q x (x) and q (). The varatonal Bayesan algorthm teratvely maxmses F m n (2.51) wth respect to the free dstrbutons, q x (x) and q (), whch s essentally coordnate ascent n the functon space of varatonal dstrbutons. The followng very general theorem provdes the update equatons for varatonal Bayesan learnng. Theorem 2.1: Varatonal Bayesan EM (VBEM). Let m be a model wth parameters gvng rse to an..d. data set y = {y 1,... y n } wth correspondng hdden varables x = {x 1,... x n }. A lower bound on the model log margnal lkelhood s F m (q x (x), q ()) = d dx q x (x)q () ln p(x, y, m) q x (x)q () (2.53) and ths can be teratvely optmsed by performng the followng updates, usng superscrpt (t) to denote teraton number: VBE step: q x (t+1) (x ) = 1 [ exp Z x ] d q (t) () ln p(x, y, m) (2.54) 54

12 2.3. Varatonal methods for Bayesan learnng where q (t+1) x (x) = n q (t+1) x (x ), (2.55) and VBM step: q (t+1) () = 1 [ p( m) exp Z ] dx q x (t+1) (x) ln p(x, y, m). (2.56) Moreover, the update rules converge to a local maxmum of F m (q x (x), q ()). Proof of q x (x ) update: usng varatonal calculus. Take functonal dervatves of F m (q x (x), q ()) wth respect to q x (x), and equate to zero: q x (x) F m(q x (x), q ()) = = [ d q () q x (x) dx q x (x) ln ] p(x, y, m) q x (x) (2.57) d q () [ln p(x, y, m) ln q x (x) 1] (2.58) = 0 (2.59) whch mples ln q x (t+1) (x) = d q (t) () ln p(x, y, m) ln Z(t+1) x, (2.60) where Z x s a normalsaton constant (from a Lagrange multpler term enforcng normalsaton of q x (x), omtted for brevty). As a consequence of the..d. assumpton, ths update can be broken down across the n data ponts ln q x (t+1) (x) = d q (t) () n ln p(x, y, m) ln Z (t+1) x, (2.61) whch mples that the optmal q x (t+1) (x) s factorsed n the form q x (t+1) (x) = n wth ln q x (t+1) (x ) = q(t+1) x (x ), d q (t) () ln p(x, y, m) ln Z (t+1) x, (2.62) wth Z x = Thus for a gven q (), there s a unque statonary pont for each q x (x ). Proof of q () update: usng varatonal calculus. n Z x. (2.63) 55

13 2.3. Varatonal methods for Bayesan learnng log margnal lkelhood ln p(y m) ln p(y m) ln p(y m) h KL q x (t) q (t) p(x, y) new lower bound h KL q x (t+1) q (t) p(x, y) F(q (t+1) x (x), q (t) ()) newer lower bound h KL q x (t+1) F(q (t+1) x q (t+1) p(x, y) (x), q (t+1) ()) lower bound F(q (t) x (x), q (t) ()) VBE step VBM step Fgure 2.3: The varatonal Bayesan EM (VBEM) algorthm. In the VBE step, the varatonal posteror over hdden varables q x (x) s set accordng to (2.60). In the VBM step, the varatonal posteror over parameters s set accordng to (2.56). Each step s guaranteed to ncrease (or leave unchanged) the lower bound on the margnal lkelhood. (Note that the exact log margnal lkelhood s a fxed quantty, and does not change wth VBE or VBM steps t s only the lower bound whch ncreases.) Proceedng as above, take functonal dervatves of F m (q x (x), q ()) wth respect to q () and equate to zero yeldng: q () F m(q x (x), q ()) = q () [ d q () ] dx q x (x) ln p(x, y, m) (2.64) p( m) + ln q () (2.65) = dx q x (x) ln p(x, y ) + ln p( m) ln q () + c (2.66) = 0, (2.67) whch upon rearrangement produces ln q (t+1) () = ln p( m) + dx q x (t+1) (x) ln p(x, y ) ln Z (t+1), (2.68) where Z s the normalsaton constant (related to the Lagrange multpler whch has agan been omtted for succnctness). Thus for a gven q x (x), there s a unque statonary pont for q (). 56

14 2.3. Varatonal methods for Bayesan learnng At ths pont t s well worth notng the symmetry between the hdden varables and the parameters. The ndvdual VBE steps can be wrtten as one batch VBE step: q x (t+1) (x) = 1 [ exp Z x d q (t) () ln p(x, y, m) ] wth Z x = (2.69) n Z x. (2.70) On the surface, t seems that the varatonal update rules (2.60) and (2.56) dffer only n the pror term p( m) over the parameters. There actually also exsts a pror term over the hdden varables as part of p(x, y, m), so ths does not resolve the two. The dstngushng feature between hdden varables and parameters s that the number of hdden varables ncreases wth data set sze, whereas the number of parameters s assumed fxed. Re-wrtng (2.53), t s easy to see that maxmsng F m (q x (x), q () s smply equvalent to mnmsng the KL dvergence between q x (x) q () and the jont posteror over hdden states and parameters p(x, y, m): ln p(y m) F m (q x (x), q ()) = d dx q x (x) q () ln q x(x) q () p(x, y, m) (2.71) = KL [q x (x) q () p(x, y, m)] (2.72) 0. (2.73) Note the smlarty between expressons (2.35) and (2.72): whle we mnmse the former wth respect to hdden varable dstrbutons and the parameters, the latter we mnmse wth respect to the hdden varable dstrbuton and a dstrbuton over parameters. The varatonal Bayesan EM algorthm reduces to the ordnary EM algorthm for ML estmaton f we restrct the parameter dstrbuton to a pont estmate,.e. a Drac delta functon, q () = δ( ), n whch case the M step smply nvolves re-estmatng. Note that the same cannot be sad n the case of MAP estmaton, whch s nherently bass dependent, unlke both VB and ML algorthms. By constructon, the VBEM algorthm s guaranteed to monotoncally ncrease an objectve functon F, as a functon of a dstrbuton over parameters and hdden varables. Snce we ntegrate over model parameters there s a naturally ncorporated model complexty penalty. It turns out that for a large class of models (see secton 2.4) the VBE step has approxmately the same computatonal complexty as the standard E step n the ML framework, whch makes t vable as a Bayesan replacement for the EM algorthm. 57

15 2.3. Varatonal methods for Bayesan learnng Dscusson The mpact of the q(x, ) q x (x)q () factorsaton Unless we make the assumpton that the posteror over parameters and hdden varables factorses, we wll not generally obtan the further hdden varable factorsaton over n that we have n equaton (2.55). In that case, the dstrbutons of x and x j wll be coupled for all cases {, j} n the data set, greatly ncreasng the overall computatonal complexty of nference. Ths further factorsaton s depcted n fgure 2.4 for the case of n = 3, where we see: (a) the orgnal drected graphcal model, where s the collecton of parameters governng pror dstrbutons over the hdden varables x and the condtonal probablty p(y x, ); (b) the moralsed graph gven the data {y 1, y 2, y 3 }, whch shows that the hdden varables are now dependent n the posteror through the uncertan parameters; (c) the effectve graph after the factorsaton assumpton, whch not only removes arcs between the parameters and hdden varables, but also removes the dependences between the hdden varables. Ths latter ndependence falls out from the optmsaton as a result of the..d. nature of the data, and s not a further approxmaton. Whlst ths factorsaton of the posteror dstrbuton over hdden varables and parameters may seem drastc, one can thnk of t as replacng stochastc dependences between x and wth determnstc dependences between relevant moments of the two sets of varables. The advantage of gnorng how fluctuatons n x nduce fluctuatons n (and vce-versa) s that we can obtan analytcal approxmatons to the log margnal lkelhood. It s these same deas that underle mean-feld approxmatons from statstcal physcs, from where these lower-boundng varatonal approxmatons were nspred (Feynman, 1972; Pars, 1988). In later chapters the consequences of the factorsaton for partcular models are studed n some detal; n partcular we wll use samplng methods to estmate by how much the varatonal bound falls short of the margnal lkelhood. What forms for q x (x) and q ()? One mght need to approxmate the posteror further than smply the hdden-varable / parameter factorsaton. A common reason for ths s that the parameter posteror may stll be ntractable despte the hdden-varable / parameter factorsaton. The free-form extremsaton of F normally provdes us wth a functonal form for q (), but ths may be unweldy; we therefore need to assume some smpler space of parameter posterors. The most commonly used dstrbutons are those wth just a few suffcent statstcs, such as the Gaussan or Drchlet dstrbutons. Takng a Gaussan example, F s then explctly extremsed wth respect to a set of varatonal parameters ζ = (µ, ν ) whch parameterse the Gaussan q ( ζ ). We wll see examples of ths approach n later chapters. There may also exst ntractabltes n the hdden varable 58

16 2.3. Varatonal methods for Bayesan learnng x 1 x 2 x 3 x 1 x 2 x 3 y 1 y 2 y 3 y 1 y 2 y 3 (a) The generatve graphcal model. (b) Graph representng the exact posteror. x 1 x 2 x 3 (c) Posteror graph after the varatonal approxmaton. Fgure 2.4: Graphcal depcton of the hdden-varable / parameter factorsaton. (a) The orgnal generatve model for n = 3. (b) The exact posteror graph gven the data. Note that for all case pars {, j}, x and x j are not drectly coupled, but nteract through. That s to say all the hdden varables are condtonally ndependent of one another, but only gven the parameters. (c) the posteror graph after the varatonal approxmaton between parameters and hdden varables, whch removes arcs between parameters and hdden varables. Note that, on assumng ths factorsaton, as a consequence of the..d. assumpton the hdden varables become ndependent. 59

17 2.3. Varatonal methods for Bayesan learnng posteror, for whch further approxmatons need be made (some examples are mentoned below). There s somethng of a dark art n dscoverng a factorsaton amongst the hdden varables and parameters such that the approxmaton remans fathful at an acceptable level. Of course t does not make sense to use a posteror form whch holds fewer condtonal ndependences than those mpled by the moral graph (see secton 1.1). The key to a good varatonal approxmaton s then to remove as few arcs as possble from the moral graph such that nference becomes tractable. In many cases the goal s to fnd tractable substructures (structured approxmatons) such as trees or mxtures of trees, whch capture as many of the arcs as possble. Some arcs may capture crucal dependences between nodes and so need be kept, whereas other arcs mght nduce a weak local correlaton at the expense of a long-range correlaton whch to frst order can be gnored; removng such an arc can have dramatc effects on the tractablty. The advantage of the varatonal Bayesan procedure s that any factorsaton of the posteror yelds a lower bound on the margnal lkelhood. Thus n practce t may pay to approxmately evaluate the computatonal cost of several canddate factorsatons, and mplement those whch can return a completed optmsaton of F wthn a certan amount of computer tme. One would expect the more complex factorsatons to take more computer tme but also yeld progressvely tghter lower bounds on average, the consequence beng that the margnal lkelhood estmate mproves over tme. An nterestng avenue of research n ths ven would be to use the varatonal posteror resultng from a smpler factorsaton as the ntalsaton for a slghtly more complcated factorsaton, and move n a chan from smple to complcated factorsatons to help avod local free energy mnma n the optmsaton. Havng proposed ths, t remans to be seen f t s possble to form a coherent closely-spaced chan of dstrbutons that are of any use, as compared to startng from the fullest posteror approxmaton from the start. Usng the lower bound for model selecton and averagng The log rato of posteror probabltes of two competng models m and m s gven by ln p(m y) p(m y) = + ln p(m) + p(y m) ln p(m ) ln p(y m ) (2.74) = + ln p(m) + F(q x, ) + KL [q(x, ) p(x, y, m)] ln p(m ) F (q x, ) KL [ q (x, ) p(x, y, m ) ] (2.75) where we have used the form n (2.72), whch s exact regardless of the qualty of the bound used, or how tghtly that bound has been optmsed. The lower bounds for the two models, F and F, are calculated from VBEM optmsatons, provdng us for each model wth an approxmaton to the posteror over the hdden varables and parameters of that model, q x, and q x, ; these may n general be functonally very dfferent (we leave asde for the moment local maxma problems 60

18 2.3. Varatonal methods for Bayesan learnng n the optmsaton process whch can be overcome to an extent by usng several dfferently ntalsed optmsatons or n some models by employng heurstcs talored to explot the model structure). When we perform model selecton by comparng the lower bounds, F and F, we are assumng that the KL dvergences n the two approxmatons are the same, so that we can use just these lower bounds as gude. Unfortunately t s non-trval to predct how tght n theory any partcular bound can be f ths were possble we could more accurately estmate the margnal lkelhood from the start. Takng an example, we would lke to know whether the bound for a model wth S mxture components s smlar to that for S + 1 components, and f not then how badly ths nconsstency affects the posteror over ths set of models. Roughly speakng, let us assume that every component n our model contrbutes a (constant) KL dvergence penalty of KL s. For clarty we use the notaton L(S) and F(S) to denote the exact log margnal lkelhood and lower bounds, respectvely, for a model wth S components. The dfference n log margnal lkelhoods, L(S + 1) L(S), s the quantty we wsh to estmate, but f we base ths on the lower bounds the dfference becomes L(S + 1) L(S) = [F(S + 1) + (S + 1) KL s ] [F(S) + S KL s ] (2.76) = F(S + 1) F(S) + KL s (2.77) F(S + 1) F(S), (2.78) where the last lne s the result we would have basng the dfference on lower bounds. Therefore there exsts a systematc error when comparng models f each component contrbutes ndependently to the KL dvergence term. Snce the KL dvergence s strctly postve, and we are basng our model selecton on (2.78) rather than (2.77), ths analyss suggests that there s a systematc bas towards smpler models. We wll n fact see ths n chapter 4, where we fnd an mportance samplng estmate of the KL dvergence showng ths behavour. Optmsng the pror dstrbutons Usually the parameter prors are functons of hyperparameters, a, so we can wrte p( a, m). In the varatonal Bayesan framework the lower bound can be made hgher by maxmsng F m wth respect to these hyperparameters: a (t+1) = arg max a F m (q x (x), q (), y, a). (2.79) A smple depcton of ths optmsaton s gven n fgure 2.5. Unlke earler n secton 2.3.1, the margnal lkelhood of model m can now be ncreased wth hyperparameter optmsaton. As we wll see n later chapters, there are examples where these hyperparameters themselves have governng hyperprors, such that they can be ntegrated over as well. The result beng that 61

19 2.3. Varatonal methods for Bayesan learnng new optmsed log margnal lkelhood ln p(y a (t+1), m) log margnal lkelhood ln p(y a (t), m) log margnal lkelhood ln p(y a (t), m) h KL q x (t+1) q (t+1) p(x, y, a (t) ) h KL q (t+1) x q (t+1) p(x, y, a (t+1) ) h KL q x (t) q (t) p(x, y, a(t) ) new lower bound F(q x (t+1) (x), q (t+1) (), a (t) ) new lower bound F(q (t+1) x (x), q (t+1) (), a (t+1) ) lower bound F(q x (t) (x), q (t) (), a(t) ) VBEM step hyperparameter optmsaton Fgure 2.5: The varatonal Bayesan EM algorthm wth hyperparameter optmsaton. The VBEM step conssts of VBE and VBM steps, as shown n fgure 2.3. The hyperparameter optmsaton ncreases the lower bound and also mproves the margnal lkelhood. we can nfer dstrbutons over these as well, just as for parameters. The reason for abstractng from the parameters ths far s that we would lke to ntegrate out all varables whose cardnalty ncreases wth model complexty; ths standpont wll be made clearer n the followng chapters. Prevous work, and general applcablty of VBEM The varatonal approach for lower boundng the margnal lkelhood (and smlar quanttes) has been explored by several researchers n the past decade, and has receved a lot of attenton recently n the machne learnng communty. It was frst proposed for one-hdden layer neural networks (whch have no hdden varables) by Hnton and van Camp (1993) where q () was restrcted to be Gaussan wth dagonal covarance. Ths work was later extended to show that tractable approxmatons were also possble wth a full covarance Gaussan (Barber and Bshop, 1998) (whch n general wll have the mode of the posteror at a dfferent locaton than n the dagonal case). Neal and Hnton (1998) presented a generalsaton of EM whch made use of Jensen s nequalty to allow partal E-steps; n ths paper the term ensemble learnng was used to descrbe the method snce t fts an ensemble of models, each wth ts own parameters. Jaakkola (1997) and Jordan et al. (1999) revew varatonal methods n a general context (.e. non-bayesan). Varatonal Bayesan methods have been appled to varous models wth hdden varables and no restrctons on q () and q x (x ) other than the assumpton that they factorse n some way (Waterhouse et al., 1996; Bshop, 1999; Ghahraman and Beal, 2000; Attas, 2000). Of partcular note s the varatonal Bayesan HMM of MacKay (1997), n whch free-form optmsatons are explctly undertaken (see chapter 3); ths work was the nspraton for the examnaton of Conjugate-Exponental (CE) models, dscussed n the next secton. An example 62

20 2.3. Varatonal methods for Bayesan learnng of a constraned optmsaton for a logstc regresson model can be found n Jaakkola and Jordan (2000). Several researchers have nvestgated usng mxture dstrbutons for the approxmate posteror, whch allows for more flexblty whlst mantanng a degree of tractablty (Lawrence et al., 1998; Bshop et al., 1998; Lawrence and Azzouz, 1999). The lower bound n these models s a sum of a two terms: a frst term whch s a convex combnaton of bounds from each mxture component, and a second term whch s the mutual nformaton between the mxture labels and the hdden varables of the model. The frst term offers no mprovement over a nave combnaton of bounds, but the second (whch s non-negatve) has to mprove on the smple bounds. Unfortunately ths term contans an expectaton over all confguratons of the hdden states and so has to be tself bounded wth a further use of Jensen s nequalty n the form of a convex bound on the log functon (ln(x) λx ln(λ) 1) (Jaakkola and Jordan, 1998). Despte ths approxmaton drawback, emprcal results n a handful of models have shown that the approxmaton does mprove the smple mean feld bound and mproves monotoncally wth the number of mxture components. A related method for approxmatng the ntegrand for Bayesan learnng s based on an dea known as assumed densty flterng (ADF) (Bernardo and Gron, 1988; Stephens, 1997; Boyen and Koller, 1998; Barber and Sollch, 2000; Frey et al., 2001), and s called the Expectaton Propagaton (EP) algorthm (Mnka, 2001a). Ths algorthm approxmates the ntegrand of nterest wth a set of terms, and through a process of repeated deleton-ncluson of term expressons, the ntegrand s teratvely refned to resemble the true ntegrand as closely as possble. Therefore the key to the method s to use terms whch can be tractably ntegrated. Ths has the same flavour as the varatonal Bayesan method descrbed here, where we teratvely update the approxmate posteror over a hdden state q x (x ) or over the parameters q (). The key dfference between EP and VB s that n the update process (.e. deleton-ncluson) EP seeks to mnmse the KL dvergence whch averages accordng to the true dstrbuton, KL [p(x, y) q(x, )] (whch s smply a moment-matchng operaton for exponental famly models), whereas VB seeks to mnmse the KL dvergence accordng to the approxmate dstrbuton, KL [q(x, ) p(x, y)]. Therefore, EP s at least attemptng to average accordng to the correct dstrbuton, whereas VB has the wrong cost functon at heart. However, n general the KL dvergence n EP can only be mnmsed separately one term at a tme, whle the KL dvergence n VB s mnmsed globally over all terms n the approxmaton. The result s that EP may stll not result n representatve posteror dstrbutons (for example, see Mnka, 2001a, fgure 3.6, p. 6). Havng sad that, t may be that more generalsed deleton-ncluson steps can be derved for EP, for example removng two or more terms at a tme from the ntegrand, and ths may allevate some of the local restrctons of the EP algorthm. As n VB, EP s constraned to use partcular parametrc famles wth a small number of moments for tractablty. An example of EP used wth an assumed Drchlet densty for the term expressons can be found n Mnka and Lafferty (2002). 63

21 2.4. Conjugate-Exponental models In the next secton we take a closer look at the varatonal Bayesan EM equatons, (2.54) and (2.56), and ask the followng questons: - To whch models can we apply VBEM?.e. whch forms of data dstrbutons p(y, x ) and prors p( m) result n tractable VBEM updates? - How does ths relate formally to conventonal EM? - When can we utlse exstng belef propagaton algorthms n the VB framework? 2.4 Conjugate-Exponental models Defnton We consder a partcular class of graphcal models wth latent varables, whch we call conjugateexponental (CE) models. In ths secton we explctly apply the varatonal Bayesan method to these parametrc famles, dervng a smple general form of VBEM for the class. Conjugate-exponental models satsfy two condtons: Condton (1). The complete-data lkelhood s n the exponental famly: p(x, y ) = g() f(x, y ) e φ() u(x,y ), (2.80) where φ() s the vector of natural parameters, u and f are the functons that defne the exponental famly, and g s a normalsaton constant: g() 1 = dx dy f(x, y ) e φ() u(x,y ). (2.81) The natural parameters for an exponental famly model φ are those that nteract lnearly wth the suffcent statstcs of the data u. For example, for a unvarate Gaussan n x wth mean µ and standard devaton σ, the necessary quanttes are obtaned from: p(x µ, σ) = exp { x2 2σ 2 + xµ σ 2 µ2 2σ 2 1 } 2 ln(2πσ2 ) (2.82) = ( σ 2, µ ) (2.83) 64

22 2.4. Conjugate-Exponental models and are: ( 1 φ() = σ 2, µ ) σ 2 ) u(x) = ( x2 2, x (2.84) (2.85) f(x) = 1 (2.86) g() = exp { µ2 2σ 2 1 } 2 ln(2πσ2 ). (2.87) Note that whlst the parametersaton for s arbtrary, e.g. we could have let = (σ, µ), the natural parameters φ are unque up to a multplcatve constant. Condton (2). The parameter pror s conjugate to the complete-data lkelhood: p( η, ν) = h(η, ν) g() η e φ() ν, (2.88) where η and ν are hyperparameters of the pror, and h s a normalsaton constant: h(η, ν) 1 = d g() η e φ() ν. (2.89) Condton 1 (2.80) n fact usually mples the exstence of a conjugate pror whch satsfes condton 2 (2.88). The pror p( η, ν) s sad to be conjugate to the lkelhood p(x, y ) f and only f the posteror p( η, ν ) p( η, ν)p(x, y ) (2.90) s of the same parametrc form as the pror. In general the exponental famles are the only classes of dstrbutons that have natural conjugate pror dstrbutons because they are the only dstrbutons wth a fxed number of suffcent statstcs apart from some rregular cases (see Gelman et al., 1995, p. 38). From the defnton of conjugacy, we see that the hyperparameters of a conjugate pror can be nterpreted as the number (η) and values (ν) of pseudo-observatons under the correspondng lkelhood. We call models that satsfy condtons 1 (2.80) and 2 (2.88) conjugate-exponental. The lst of latent-varable models of practcal nterest wth complete-data lkelhoods n the exponental famly s very long, for example: Gaussan mxtures, factor analyss, prncpal components analyss, hdden Markov models and extensons, swtchng state-space models, dscretevarable belef networks. Of course there are also many as yet undreamt-of models combnng Gaussan, gamma, Posson, Drchlet, Wshart, multnomal, and other dstrbutons n the exponental famly. 65

23 2.4. Conjugate-Exponental models However there are some notable outcasts whch do not satsfy the condtons for membershp of the CE famly, namely: Boltzmann machnes (Ackley et al., 1985), logstc regresson and sgmod belef networks (Bshop, 1995), and ndependent components analyss (ICA) (as presented n Comon, 1994; Bell and Sejnowsk, 1995), all of whch are wdely used n the machne learnng communty. As an example let us see why logstc regresson s not n the conjugateexponental famly: for y { 1, 1}, the lkelhood under a logstc regresson model s p(y x, ) = e y x e x + e x, (2.91) where x s the regressor for data pont and s a vector of weghts, potentally ncludng a bas. Ths can be rewrtten as p(y x, ) = e y x f(,x ), (2.92) where f(, x ) s a normalsaton constant. To belong n the exponental famly the normalsng constant must splt nto functons of only and only (x, y ). Expandng f(, x ) yelds a seres of powers of x, whch could be assmlated nto the φ() u(x, y ) term by augmentng the natural parameter and suffcent statstcs vectors, f t were not for the fact that the seres s nfnte meanng that there would need to be an nfnty of natural parameters. Ths means we cannot represent the lkelhood wth a fnte number of suffcent statstcs. Models whose complete-data lkelhood s not n the exponental famly can often be approxmated by models whch are n the exponental famly and have been gven addtonal hdden varables. A very good example s the Independent Factor Analyss (IFA) model of Attas (1999a). In conventonal ICA, one can thnk of the model as usng non-gaussan sources, or usng Gaussan sources passed through a non-lnearty to make them non-gaussan. For most non-lneartes commonly used (such as the logstc), the complete-data lkelhood becomes non-ce. Attas recasts the model as a mxture of Gaussan sources beng fed nto a lnear mxng matrx. Ths model s n the CE famly and so can be tackled wth the VB treatment. It s an open area of research to nvestgate how best to brng models nto the CE famly, such that nferences n the modfed model resemble the orgnal as closely as possble Varatonal Bayesan EM for CE models In Bayesan nference we want to determne the posteror over parameters and hdden varables p(x, y, η, ν). In general ths posteror s nether conjugate nor n the exponental famly. In ths subsecton we see how the propertes of the CE famly make t especally amenable to the VB approxmaton, and derve the VBEM algorthm for CE models. 66

24 2.4. Conjugate-Exponental models Theorem 2.2: Varatonal Bayesan EM for Conjugate-Exponental Models. Gven an..d. data set y = {y 1,... y n }, f the model satsfes condtons (1) and (2), then the followng (a), (b) and (c) hold: (a) the VBE step yelds: q x (x) = and q x (x ) s n the exponental famly: n q x (x ), (2.93) q x (x ) f(x, y ) e φ u(x,y ) = p(x y, φ), (2.94) wth a natural parameter vector φ = d q ()φ() φ() q () (2.95) obtaned by takng the expectaton of φ() under q () (denoted usng angle-brackets ). For nvertble φ, defnng such that φ( ) = φ, we can rewrte the approxmate posteror as q x (x ) = p(x y, ). (2.96) (b) the VBM step yelds that q () s conjugate and of the form: q () = h( η, ν) g() η e φ() ν, (2.97) where η = η + n, (2.98) n ν = ν + u(y ), (2.99) and u(y ) = u(x, y ) qx (x ) (2.100) s the expectaton of the suffcent statstc u. We have used qx (x ) to denote expectaton under the varatonal posteror over the latent varable(s) assocated wth the th datum. (c) parts (a) and (b) hold for every teraton of varatonal Bayesan EM. Proof of (a): by drect substtuton. 67

25 2.4. Conjugate-Exponental models Startng from the varatonal extrema soluton (2.60) for the VBE step: q x (x) = 1 Z x e ln p(x,y,m) q (), (2.101) substtute the parametrc form for p(x, y, m) n condton 1 (2.80), whch yelds (omttng teraton superscrpts): q x (x) = 1 P n ln g()+ln f(x,y )+φ() e u(x,y ) q Z () (2.102) x [ = 1 n ] f(x, y ) e P n φ u(x,y ), (2.103) Z x where Z x has absorbed constants ndependent of x, and we have defned wthout loss of generalty: φ = φ() q (). (2.104) If φ s nvertble, then there exsts a such that φ = φ( ), and we can rewrte (2.103) as: q x (x) = 1 = Z x [ n ] f(x, y )e φ( ) u(x,y ) (2.105) n p(x, y, m) (2.106) n q x (x ) (2.107) = p(x, y, m). (2.108) Thus the result of the approxmate VBE step, whch averages over the ensemble of models q (), s exactly the same as an exact E step, calculated at the varatonal Bayes pont estmate. Proof of (b): by drect substtuton. Startng from the varatonal extrema soluton (2.56) for the VBM step: q () = 1 Z p( m) e ln p(x,y,m) qx(x), (2.109) 68

26 2.4. Conjugate-Exponental models substtute the parametrc forms for p( m) and p(x, y, m) as specfed n condtons 2 (2.88) and 1 (2.80) respectvely, whch yelds (omttng teraton superscrpts): where q () = 1 Z h(η, ν)g() η e φ() ν e Pn ln g()+ln f(x,y )+φ() u(x,y ) qx(x) (2.110) = 1 Z h(η, ν)g() η+n e φ() [ν+ P n u(y P )] n e ln f(x,y ) qx(x) }{{} has no dependence (2.111) = h( η, ν)g() η e φ() ν, (2.112) h( η, ν) = 1 P n e ln f(x,y ) qx(x). (2.113) Z Therefore the varatonal posteror q () n (2.112) s of conjugate form, accordng to condton 2 (2.88). Proof of (c): by nducton. Assume condtons 1 (2.80) and 2 (2.88) are met (.e. the model s n the CE famly). From part (a), the VBE step produces a posteror dstrbuton q x (x) n the exponental famly, preservng condton 1 (2.80); the parameter dstrbuton q () remans unaltered, preservng condton 2 (2.88). From part (b), the VBM step produces a parameter posteror q () that s of conjugate form, preservng condton 2 (2.88); q x (x) remans unaltered from the VBE step, preservng condton 1 (2.80). Thus under both the VBE and VBM steps, conjugate-exponentalty s preserved, whch makes the theorem applcable at every teraton of VBEM. As before, snce q () and q x (x ) are coupled, (2.97) and (2.94) do not provde an analytc soluton to the mnmsaton problem, so the optmsaton problem s solved numercally by teratng between the fxed pont equatons gven by these equatons. To summarse brefly: VBE Step: Compute the expected suffcent statstcs {u(y )} n under the hdden varable dstrbutons q x (x ), for all. VBM Step: Compute the expected natural parameters φ = φ() under the parameter dstrbuton gven by η and ν Implcatons In order to really understand what the conjugate-exponental formalsm buys us, let us reterate the man ponts of theorem 2.2 above. The frst result s that n the VBM step the analytcal form of the varatonal posteror q () does not change durng teratons of VBEM e.g. f the posteror s Gaussan at teraton t = 1, then only a Gaussan need be represented at future teratons. If t were able to change, whch s the case n general (theorem 2.1), the 69

27 2.4. Conjugate-Exponental models EM for MAP estmaton Varatonal Bayesan EM Goal: maxmse p( y, m) w.r.t. Goal: lower bound p(y m) E Step: compute VBE Step: compute q x (t+1) (x) = p(x y, (t) ) q x (t+1) (x) = p(x y, φ (t) ) M Step: VBM Step: (t+1) (t+1) = arg max dx q x (x) ln p(x, y, ) q (t+1) () exp dx q x (t+1) (x) ln p(x, y, ) Table 2.1: Comparson of EM for ML/MAP estmaton aganst varatonal Bayesan EM for CE models. posteror could quckly become unmanageable, and (further) approxmatons would be requred to prevent the algorthm becomng too complcated. The second result s that the posteror over hdden varables calculated n the VBE step s exactly the posteror that would be calculated had we been performng an ML/MAP E step. That s, the nferences usng an ensemble of models q () can be represented by the effect of a pont parameter,. The task of performng many nferences, each of whch corresponds to a dfferent parameter settng, can be replaced wth a sngle nference step t s possble to nfer the hdden states n a conjugate exponental model tractably whle ntegratng over an ensemble of model parameters. Comparson to EM for ML/MAP parameter estmaton We can draw a tght parallel between the EM algorthm for ML/MAP estmaton, and our VBEM algorthm appled specfcally to conjugate-exponental models. These are summarsed n table 2.1. Ths general result of VBEM for CE models was reported n Ghahraman and Beal (2001), and generalses the well known EM algorthm for ML estmaton (Dempster et al., 1977). It s a specal case of the varatonal Bayesan algorthm (theorem 2.1) used n Ghahraman and Beal (2000) and n Attas (2000), yet encompasses many of the models that have been so far subjected to the varatonal treatment. Its partcular usefulness s as a gude for the desgn of models, to make them amenable to effcent approxmate Bayesan nference. The VBE step has about the same tme complexty as the E step, and s n all ways dentcal except that t s re-wrtten n terms of the expected natural parameters. In partcular, we can make use of all relevant propagaton algorthms such as juncton tree, Kalman smoothng, or belef propagaton. The VBM step computes a dstrbuton over parameters (n the conjugate famly) rather than a pont estmate. Both ML/MAP EM and VBEM algorthms monotoncally ncrease an objectve functon, but the latter also ncorporates a model complexty penalty by 70

Expectation propagation

Expectation propagation Expectaton propagaton Lloyd Ellott May 17, 2011 Suppose p(x) s a pdf and we have a factorzaton p(x) = 1 Z n f (x). (1) =1 Expectaton propagaton s an nference algorthm desgned to approxmate the factors

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

Conjugacy and the Exponential Family

Conjugacy and the Exponential Family CS281B/Stat241B: Advanced Topcs n Learnng & Decson Makng Conjugacy and the Exponental Famly Lecturer: Mchael I. Jordan Scrbes: Bran Mlch 1 Conjugacy In the prevous lecture, we saw conjugate prors for the

More information

EM and Structure Learning

EM and Structure Learning EM and Structure Learnng Le Song Machne Learnng II: Advanced Topcs CSE 8803ML, Sprng 2012 Partally observed graphcal models Mxture Models N(μ 1, Σ 1 ) Z X N N(μ 2, Σ 2 ) 2 Gaussan mxture model Consder

More information

Relevance Vector Machines Explained

Relevance Vector Machines Explained October 19, 2010 Relevance Vector Machnes Explaned Trstan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introducton Ths document has been wrtten n an attempt to make Tppng s [1] Relevance Vector Machnes

More information

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Hidden Markov Models & The Multivariate Gaussian (10/26/04) CS281A/Stat241A: Statstcal Learnng Theory Hdden Markov Models & The Multvarate Gaussan (10/26/04) Lecturer: Mchael I. Jordan Scrbes: Jonathan W. Hu 1 Hdden Markov Models As a bref revew, hdden Markov models

More information

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction ECONOMICS 5* -- NOTE (Summary) ECON 5* -- NOTE The Multple Classcal Lnear Regresson Model (CLRM): Specfcaton and Assumptons. Introducton CLRM stands for the Classcal Lnear Regresson Model. The CLRM s also

More information

NUMERICAL DIFFERENTIATION

NUMERICAL DIFFERENTIATION NUMERICAL DIFFERENTIATION 1 Introducton Dfferentaton s a method to compute the rate at whch a dependent output y changes wth respect to the change n the ndependent nput x. Ths rate of change s called the

More information

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm The Expectaton-Maxmaton Algorthm Charles Elan elan@cs.ucsd.edu November 16, 2007 Ths chapter explans the EM algorthm at multple levels of generalty. Secton 1 gves the standard hgh-level verson of the algorthm.

More information

Limited Dependent Variables

Limited Dependent Variables Lmted Dependent Varables. What f the left-hand sde varable s not a contnuous thng spread from mnus nfnty to plus nfnty? That s, gven a model = f (, β, ε, where a. s bounded below at zero, such as wages

More information

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X Statstcs 1: Probablty Theory II 37 3 EPECTATION OF SEVERAL RANDOM VARIABLES As n Probablty Theory I, the nterest n most stuatons les not on the actual dstrbuton of a random vector, but rather on a number

More information

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors Stat60: Bayesan Modelng and Inference Lecture Date: February, 00 Reference Prors Lecturer: Mchael I. Jordan Scrbe: Steven Troxler and Wayne Lee In ths lecture, we assume that θ R; n hgher-dmensons, reference

More information

1 Motivation and Introduction

1 Motivation and Introduction Instructor: Dr. Volkan Cevher EXPECTATION PROPAGATION September 30, 2008 Rce Unversty STAT 63 / ELEC 633: Graphcal Models Scrbes: Ahmad Beram Andrew Waters Matthew Nokleby Index terms: Approxmate nference,

More information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4) I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

Hidden Markov Models

Hidden Markov Models Hdden Markov Models Namrata Vaswan, Iowa State Unversty Aprl 24, 204 Hdden Markov Model Defntons and Examples Defntons:. A hdden Markov model (HMM) refers to a set of hdden states X 0, X,..., X t,...,

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline Outlne Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Huzhen Yu janey.yu@cs.helsnk.f Dept. Computer Scence, Unv. of Helsnk Probablstc Models, Sprng, 200 Notces: I corrected a number

More information

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF 10-708: Probablstc Graphcal Models 10-708, Sprng 2014 8 : Learnng n Fully Observed Markov Networks Lecturer: Erc P. Xng Scrbes: Meng Song, L Zhou 1 Why We Need to Learn Undrected Graphcal Models In the

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement Markov Chan Monte Carlo MCMC, Gbbs Samplng, Metropols Algorthms, and Smulated Annealng 2001 Bonformatcs Course Supplement SNU Bontellgence Lab http://bsnuackr/ Outlne! Markov Chan Monte Carlo MCMC! Metropols-Hastngs

More information

Difference Equations

Difference Equations Dfference Equatons c Jan Vrbk 1 Bascs Suppose a sequence of numbers, say a 0,a 1,a,a 3,... s defned by a certan general relatonshp between, say, three consecutve values of the sequence, e.g. a + +3a +1

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.65/15.070J Fall 013 Lecture 1 10/1/013 Martngale Concentraton Inequaltes and Applcatons Content. 1. Exponental concentraton for martngales wth bounded ncrements.

More information

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010 Parametrc fractonal mputaton for mssng data analyss Jae Kwang Km Survey Workng Group Semnar March 29, 2010 1 Outlne Introducton Proposed method Fractonal mputaton Approxmaton Varance estmaton Multple mputaton

More information

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016 U.C. Berkeley CS94: Spectral Methods and Expanders Handout 8 Luca Trevsan February 7, 06 Lecture 8: Spectral Algorthms Wrap-up In whch we talk about even more generalzatons of Cheeger s nequaltes, and

More information

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30 STATS 306B: Unsupervsed Learnng Sprng 2014 Lecture 10 Aprl 30 Lecturer: Lester Mackey Scrbe: Joey Arthur, Rakesh Achanta 10.1 Factor Analyss 10.1.1 Recap Recall the factor analyss (FA) model for lnear

More information

Lecture 12: Discrete Laplacian

Lecture 12: Discrete Laplacian Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly

More information

Gaussian Mixture Models

Gaussian Mixture Models Lab Gaussan Mxture Models Lab Objectve: Understand the formulaton of Gaussan Mxture Models (GMMs) and how to estmate GMM parameters. You ve already seen GMMs as the observaton dstrbuton n certan contnuous

More information

Gaussian process classification: a message-passing viewpoint

Gaussian process classification: a message-passing viewpoint Gaussan process classfcaton: a message-passng vewpont Flpe Rodrgues fmpr@de.uc.pt November 014 Abstract The goal of ths short paper s to provde a message-passng vewpont of the Expectaton Propagaton EP

More information

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix Lectures - Week 4 Matrx norms, Condtonng, Vector Spaces, Lnear Independence, Spannng sets and Bass, Null space and Range of a Matrx Matrx Norms Now we turn to assocatng a number to each matrx. We could

More information

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U) Econ 413 Exam 13 H ANSWERS Settet er nndelt 9 deloppgaver, A,B,C, som alle anbefales å telle lkt for å gøre det ltt lettere å stå. Svar er gtt . Unfortunately, there s a prntng error n the hnt of

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

3.1 ML and Empirical Distribution

3.1 ML and Empirical Distribution 67577 Intro. to Machne Learnng Fall semester, 2008/9 Lecture 3: Maxmum Lkelhood/ Maxmum Entropy Dualty Lecturer: Amnon Shashua Scrbe: Amnon Shashua 1 In the prevous lecture we defned the prncple of Maxmum

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

Composite Hypotheses testing

Composite Hypotheses testing Composte ypotheses testng In many hypothess testng problems there are many possble dstrbutons that can occur under each of the hypotheses. The output of the source s a set of parameters (ponts n a parameter

More information

Global Sensitivity. Tuesday 20 th February, 2018

Global Sensitivity. Tuesday 20 th February, 2018 Global Senstvty Tuesday 2 th February, 28 ) Local Senstvty Most senstvty analyses [] are based on local estmates of senstvty, typcally by expandng the response n a Taylor seres about some specfc values

More information

Problem Set 9 Solutions

Problem Set 9 Solutions Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem

More information

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg prnceton unv. F 17 cos 521: Advanced Algorthm Desgn Lecture 7: LP Dualty Lecturer: Matt Wenberg Scrbe: LP Dualty s an extremely useful tool for analyzng structural propertes of lnear programs. Whle there

More information

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2) 1/16 MATH 829: Introducton to Data Mnng and Analyss The EM algorthm (part 2) Domnque Gullot Departments of Mathematcal Scences Unversty of Delaware Aprl 20, 2016 Recall 2/16 We are gven ndependent observatons

More information

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?

More information

Lecture 3: Probability Distributions

Lecture 3: Probability Distributions Lecture 3: Probablty Dstrbutons Random Varables Let us begn by defnng a sample space as a set of outcomes from an experment. We denote ths by S. A random varable s a functon whch maps outcomes nto the

More information

MA 323 Geometric Modelling Course Notes: Day 13 Bezier Curves & Bernstein Polynomials

MA 323 Geometric Modelling Course Notes: Day 13 Bezier Curves & Bernstein Polynomials MA 323 Geometrc Modellng Course Notes: Day 13 Bezer Curves & Bernsten Polynomals Davd L. Fnn Over the past few days, we have looked at de Casteljau s algorthm for generatng a polynomal curve, and we have

More information

Homework Assignment 3 Due in class, Thursday October 15

Homework Assignment 3 Due in class, Thursday October 15 Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.

More information

The Feynman path integral

The Feynman path integral The Feynman path ntegral Aprl 3, 205 Hesenberg and Schrödnger pctures The Schrödnger wave functon places the tme dependence of a physcal system n the state, ψ, t, where the state s a vector n Hlbert space

More information

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1 On an Extenson of Stochastc Approxmaton EM Algorthm for Incomplete Data Problems Vahd Tadayon Abstract: The Stochastc Approxmaton EM (SAEM algorthm, a varant stochastc approxmaton of EM, s a versatle tool

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

Linear Regression Analysis: Terminology and Notation

Linear Regression Analysis: Terminology and Notation ECON 35* -- Secton : Basc Concepts of Regresson Analyss (Page ) Lnear Regresson Analyss: Termnology and Notaton Consder the generc verson of the smple (two-varable) lnear regresson model. It s represented

More information

Hidden Markov Models

Hidden Markov Models CM229S: Machne Learnng for Bonformatcs Lecture 12-05/05/2016 Hdden Markov Models Lecturer: Srram Sankararaman Scrbe: Akshay Dattatray Shnde Edted by: TBD 1 Introducton For a drected graph G we can wrte

More information

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) Maxmum Lkelhood Estmaton (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Wnter 01 UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y

More information

arxiv: v2 [stat.me] 26 Jun 2012

arxiv: v2 [stat.me] 26 Jun 2012 The Two-Way Lkelhood Rato (G Test and Comparson to Two-Way χ Test Jesse Hoey June 7, 01 arxv:106.4881v [stat.me] 6 Jun 01 1 One-Way Lkelhood Rato or χ test Suppose we have a set of data x and two hypotheses

More information

Assortment Optimization under MNL

Assortment Optimization under MNL Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.

More information

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements CS 750 Machne Learnng Lecture 5 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square CS 750 Machne Learnng Announcements Homework Due on Wednesday before the class Reports: hand n before

More information

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity LINEAR REGRESSION ANALYSIS MODULE IX Lecture - 30 Multcollnearty Dr. Shalabh Department of Mathematcs and Statstcs Indan Insttute of Technology Kanpur 2 Remedes for multcollnearty Varous technques have

More information

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they

More information

STAT 3008 Applied Regression Analysis

STAT 3008 Applied Regression Analysis STAT 3008 Appled Regresson Analyss Tutoral : Smple Lnear Regresson LAI Chun He Department of Statstcs, The Chnese Unversty of Hong Kong 1 Model Assumpton To quantfy the relatonshp between two factors,

More information

1 Convex Optimization

1 Convex Optimization Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,

More information

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve

More information

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers Psychology 282 Lecture #24 Outlne Regresson Dagnostcs: Outlers In an earler lecture we studed the statstcal assumptons underlyng the regresson model, ncludng the followng ponts: Formal statement of assumptons.

More information

The Geometry of Logit and Probit

The Geometry of Logit and Probit The Geometry of Logt and Probt Ths short note s meant as a supplement to Chapters and 3 of Spatal Models of Parlamentary Votng and the notaton and reference to fgures n the text below s to those two chapters.

More information

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X The EM Algorthm (Dempster, Lard, Rubn 1977 The mssng data or ncomplete data settng: An Observed Data Lkelhood (ODL that s a mxture or ntegral of Complete Data Lkelhoods (CDL. (1a ODL(;Y = [Y;] = [Y,][

More information

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute

More information

4DVAR, according to the name, is a four-dimensional variational method.

4DVAR, according to the name, is a four-dimensional variational method. 4D-Varatonal Data Assmlaton (4D-Var) 4DVAR, accordng to the name, s a four-dmensonal varatonal method. 4D-Var s actually a drect generalzaton of 3D-Var to handle observatons that are dstrbuted n tme. The

More information

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family IOSR Journal of Mathematcs IOSR-JM) ISSN: 2278-5728. Volume 3, Issue 3 Sep-Oct. 202), PP 44-48 www.osrjournals.org Usng T.O.M to Estmate Parameter of dstrbutons that have not Sngle Exponental Famly Jubran

More information

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models Computaton of Hgher Order Moments from Two Multnomal Overdsperson Lkelhood Models BY J. T. NEWCOMER, N. K. NEERCHAL Department of Mathematcs and Statstcs, Unversty of Maryland, Baltmore County, Baltmore,

More information

Canonical transformations

Canonical transformations Canoncal transformatons November 23, 2014 Recall that we have defned a symplectc transformaton to be any lnear transformaton M A B leavng the symplectc form nvarant, Ω AB M A CM B DΩ CD Coordnate transformatons,

More information

Introduction to Hidden Markov Models

Introduction to Hidden Markov Models Introducton to Hdden Markov Models Alperen Degrmenc Ths document contans dervatons and algorthms for mplementng Hdden Markov Models. The content presented here s a collecton of my notes and personal nsghts

More information

Module 9. Lecture 6. Duality in Assignment Problems

Module 9. Lecture 6. Duality in Assignment Problems Module 9 1 Lecture 6 Dualty n Assgnment Problems In ths lecture we attempt to answer few other mportant questons posed n earler lecture for (AP) and see how some of them can be explaned through the concept

More information

A Robust Method for Calculating the Correlation Coefficient

A Robust Method for Calculating the Correlation Coefficient A Robust Method for Calculatng the Correlaton Coeffcent E.B. Nven and C. V. Deutsch Relatonshps between prmary and secondary data are frequently quantfed usng the correlaton coeffcent; however, the tradtonal

More information

Chapter 20 Duration Analysis

Chapter 20 Duration Analysis Chapter 20 Duraton Analyss Duraton: tme elapsed untl a certan event occurs (weeks unemployed, months spent on welfare). Survval analyss: duraton of nterest s survval tme of a subject, begn n an ntal state

More information

More metrics on cartesian products

More metrics on cartesian products More metrcs on cartesan products If (X, d ) are metrc spaces for 1 n, then n Secton II4 of the lecture notes we defned three metrcs on X whose underlyng topologes are the product topology The purpose of

More information

Lecture 12: Classification

Lecture 12: Classification Lecture : Classfcaton g Dscrmnant functons g The optmal Bayes classfer g Quadratc classfers g Eucldean and Mahalanobs metrcs g K Nearest Neghbor Classfers Intellgent Sensor Systems Rcardo Guterrez-Osuna

More information

Differentiating Gaussian Processes

Differentiating Gaussian Processes Dfferentatng Gaussan Processes Andrew McHutchon Aprl 17, 013 1 Frst Order Dervatve of the Posteror Mean The posteror mean of a GP s gven by, f = x, X KX, X 1 y x, X α 1 Only the x, X term depends on the

More information

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1 C/CS/Phy9 Problem Set 3 Solutons Out: Oct, 8 Suppose you have two qubts n some arbtrary entangled state ψ You apply the teleportaton protocol to each of the qubts separately What s the resultng state obtaned

More information

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011 Stanford Unversty CS359G: Graph Parttonng and Expanders Handout 4 Luca Trevsan January 3, 0 Lecture 4 In whch we prove the dffcult drecton of Cheeger s nequalty. As n the past lectures, consder an undrected

More information

Notes on Frequency Estimation in Data Streams

Notes on Frequency Estimation in Data Streams Notes on Frequency Estmaton n Data Streams In (one of) the data streamng model(s), the data s a sequence of arrvals a 1, a 2,..., a m of the form a j = (, v) where s the dentty of the tem and belongs to

More information

Lecture 4. Instructor: Haipeng Luo

Lecture 4. Instructor: Haipeng Luo Lecture 4 Instructor: Hapeng Luo In the followng lectures, we focus on the expert problem and study more adaptve algorthms. Although Hedge s proven to be worst-case optmal, one may wonder how well t would

More information

Computing MLE Bias Empirically

Computing MLE Bias Empirically Computng MLE Bas Emprcally Kar Wa Lm Australan atonal Unversty January 3, 27 Abstract Ths note studes the bas arses from the MLE estmate of the rate parameter and the mean parameter of an exponental dstrbuton.

More information

Markov Chain Monte Carlo Lecture 6

Markov Chain Monte Carlo Lecture 6 where (x 1,..., x N ) X N, N s called the populaton sze, f(x) f (x) for at least one {1, 2,..., N}, and those dfferent from f(x) are called the tral dstrbutons n terms of mportance samplng. Dfferent ways

More information

Chapter 11: Simple Linear Regression and Correlation

Chapter 11: Simple Linear Regression and Correlation Chapter 11: Smple Lnear Regresson and Correlaton 11-1 Emprcal Models 11-2 Smple Lnear Regresson 11-3 Propertes of the Least Squares Estmators 11-4 Hypothess Test n Smple Lnear Regresson 11-4.1 Use of t-tests

More information

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE Analytcal soluton s usually not possble when exctaton vares arbtrarly wth tme or f the system s nonlnear. Such problems can be solved by numercal tmesteppng

More information

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition) Count Data Models See Book Chapter 11 2 nd Edton (Chapter 10 1 st Edton) Count data consst of non-negatve nteger values Examples: number of drver route changes per week, the number of trp departure changes

More information

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Analyss of Varance and Desgn of Exerments-I MODULE III LECTURE - 2 EXPERIMENTAL DESIGN MODELS Dr. Shalabh Deartment of Mathematcs and Statstcs Indan Insttute of Technology Kanur 2 We consder the models

More information

Numerical Heat and Mass Transfer

Numerical Heat and Mass Transfer Master degree n Mechancal Engneerng Numercal Heat and Mass Transfer 06-Fnte-Dfference Method (One-dmensonal, steady state heat conducton) Fausto Arpno f.arpno@uncas.t Introducton Why we use models and

More information

Integrals and Invariants of Euler-Lagrange Equations

Integrals and Invariants of Euler-Lagrange Equations Lecture 16 Integrals and Invarants of Euler-Lagrange Equatons ME 256 at the Indan Insttute of Scence, Bengaluru Varatonal Methods and Structural Optmzaton G. K. Ananthasuresh Professor, Mechancal Engneerng,

More information

Bayesian predictive Configural Frequency Analysis

Bayesian predictive Configural Frequency Analysis Psychologcal Test and Assessment Modelng, Volume 54, 2012 (3), 285-292 Bayesan predctve Confgural Frequency Analyss Eduardo Gutérrez-Peña 1 Abstract Confgural Frequency Analyss s a method for cell-wse

More information

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012 MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example:

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

Goodness of fit and Wilks theorem

Goodness of fit and Wilks theorem DRAFT 0.0 Glen Cowan 3 June, 2013 Goodness of ft and Wlks theorem Suppose we model data y wth a lkelhood L(µ) that depends on a set of N parameters µ = (µ 1,...,µ N ). Defne the statstc t µ ln L(µ) L(ˆµ),

More information

Learning from Data 1 Naive Bayes

Learning from Data 1 Naive Bayes Learnng from Data 1 Nave Bayes Davd Barber dbarber@anc.ed.ac.uk course page : http://anc.ed.ac.uk/ dbarber/lfd1/lfd1.html c Davd Barber 2001, 2002 1 Learnng from Data 1 : c Davd Barber 2001,2002 2 1 Why

More information

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin Fnte Mxture Models and Expectaton Maxmzaton Most sldes are from: Dr. Maro Fgueredo, Dr. Anl Jan and Dr. Rong Jn Recall: The Supervsed Learnng Problem Gven a set of n samples X {(x, y )},,,n Chapter 3 of

More information

Linear Approximation with Regularization and Moving Least Squares

Linear Approximation with Regularization and Moving Least Squares Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...

More information

Chapter 12. Ordinary Differential Equation Boundary Value (BV) Problems

Chapter 12. Ordinary Differential Equation Boundary Value (BV) Problems Chapter. Ordnar Dfferental Equaton Boundar Value (BV) Problems In ths chapter we wll learn how to solve ODE boundar value problem. BV ODE s usuall gven wth x beng the ndependent space varable. p( x) q(

More information

The Basic Idea of EM

The Basic Idea of EM The Basc Idea of EM Janxn Wu LAMDA Group Natonal Key Lab for Novel Software Technology Nanjng Unversty, Chna wujx2001@gmal.com June 7, 2017 Contents 1 Introducton 1 2 GMM: A workng example 2 2.1 Gaussan

More information

MMA and GCMMA two methods for nonlinear optimization

MMA and GCMMA two methods for nonlinear optimization MMA and GCMMA two methods for nonlnear optmzaton Krster Svanberg Optmzaton and Systems Theory, KTH, Stockholm, Sweden. krlle@math.kth.se Ths note descrbes the algorthms used n the author s 2007 mplementatons

More information

CHAPTER 14 GENERAL PERTURBATION THEORY

CHAPTER 14 GENERAL PERTURBATION THEORY CHAPTER 4 GENERAL PERTURBATION THEORY 4 Introducton A partcle n orbt around a pont mass or a sphercally symmetrc mass dstrbuton s movng n a gravtatonal potental of the form GM / r In ths potental t moves

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Maxmum Lkelhood Estmaton INFO-2301: Quanttatve Reasonng 2 Mchael Paul and Jordan Boyd-Graber MARCH 7, 2017 INFO-2301: Quanttatve Reasonng 2 Paul and Boyd-Graber Maxmum Lkelhood Estmaton 1 of 9 Why MLE?

More information

Classification as a Regression Problem

Classification as a Regression Problem Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class

More information

SDMML HT MSc Problem Sheet 4

SDMML HT MSc Problem Sheet 4 SDMML HT 06 - MSc Problem Sheet 4. The recever operatng characterstc ROC curve plots the senstvty aganst the specfcty of a bnary classfer as the threshold for dscrmnaton s vared. Let the data space be

More information

Statistical learning

Statistical learning Statstcal learnng Model the data generaton process Learn the model parameters Crteron to optmze: Lkelhood of the dataset (maxmzaton) Maxmum Lkelhood (ML) Estmaton: Dataset X Statstcal model p(x;θ) (θ parameters)

More information