1 Motivation and Introduction

Size: px

Start display at page:

Download "1 Motivation and Introduction"

Amelia Hawkins
5 years ago
Views:

1 Instructor: Dr. Volkan Cevher EXPECTATION PROPAGATION September 30, 2008 Rce Unversty STAT 63 / ELEC 633: Graphcal Models Scrbes: Ahmad Beram Andrew Waters Matthew Nokleby Index terms: Approxmate nference, Expectaton propagaton, exponental famly. Motvaton and Introducton In class, we have encountered nference problems that are too complex to solve exactly. As a result, we have looked at methods, such as Laplace s method and varatonal Bayes, that smplfy the problem. In such methods, we smplfy the form of the dstrbutons (by makng them a product of smpler dstrbutons, for example) whle mnmzng some error metrc (such as the KL-dvergence). Here, we focus on the expectaton propagaton (EP) algorthm [, 2]. EP s a determnstc algorthm whch approxmates the true dstrbutons wth exponental-famly dstrbutons. Specfcally, EP apples to probablstc models where the jont dstrbuton has the factorzaton p(x, θ) = f (θ), () where the factors f (θ) are f 0 (θ) = p(θ), (2) f (θ) = p(x θ). (3) where x s the observed data, and θ s the latent varable whose dstrbuton we hope to nfer. So, the data x are condtonally ndependent gven θ. We can compute the posteror accordng to Bayes rule: where the evdence p(x) s p(θ x) = p(x) f (θ), (4) p(x) = f (θ)dθ. (5) Of course, f the factors f (θ) are complcated, computng the posteror may be dffcult. So, we approxmate p(θ x) wth a mathematcally tractable dstrbuton q(θ). In EP, we choose q(θ) to be of the form q(θ) = f (θ), (6) Z where each f (θ) s a member of the exponental famly. We ll revew the exponental famly n the next secton, but for now t s enough to say that the exponental famly s consderably easer to work wth, whch smplfes the Note that the factors f (θ) do ndeed depend on x.

2 nference problem. Each factor f (θ) n the approxmaton corresponds to one factor f (θ). We want to fnd the best exponental-famly approxmaton of p accordng to a meanngful error metrc. For EP, our error metrc s KL(p q). Notce that we do not use KL(q p) as we dd prevously n varatonal Bayes. We prevously dscussed the dfferences between the two. For example, KL(q p) brngs about bad results when the orgnal approxmaton has multple modes. However, even assumng that q(θ) belongs to the exponental famly, mnmzng KL(p q) s not feasble. Instead, EP presents teratons whch, n practce, satsfactorly mnmze KL(p q). But there are no guarantees of optmalty, or even convergence, of the algorthm. The remander of the document s organzed as follows. In Secton 2 we brefly revew the exponental famly. In Secton 3 we detal the steps of the expectaton propagaton algorthm. We conclude wth two examples, borrowed from [], n Secton 4. 2 Exponental Famly In ths secton, we brefly revew the exponental famly. A dstrbuton f (θ) from the exponental famly has the form f (θ) = h(θ)g(η) exp {η T u(θ)}, (7) where η s a vector of hyperparameters, and the functons h(θ), g(η), and u(η) are known. Although t has a smple form, the exponental famly s a broad famly, whch ncludes, for example, the Gaussan dstrbuton. A bvarate Gaussan dstrbuton looks lke p(x µ, Σ) = for x = (x, x 2 ) and Σ = Λ. We can rewrte ths n the form of (7): 2π Σ exp { 2 (x µ)t Σ (x µ)}, (8) p(x) = h(x)g(η) exp { η T u(x) }, (9) where we choose u(x) = x x 2 x x 2 x 2, η = Λ µ + Λ 2 µ 2 Λ 22 µ 2 + Λ 2 µ Λ 2 2 Λ. (0) x Λ 22 We also choose h(x) = and g(η) = { 2π Σ exp 2 Λ µ 2 Λ 2 µ µ 2 } 2 Λ 22µ 2 2, () whch can be rewrtten as g(η) = π 4η 4 η 5 η 2 3 exp {η µ + η 2 µ 2 }, (2) 2

3 where µ = η 2η 3 2η η 5 4η 4 η 5 η3 2, (3) µ 2 = η η 3 2η 2 η 4 4η 4 η 5 η3 2. (4) 3 Implementaton of Expectaton Propagaton In ths secton, we dscuss the mplementaton of the expectaton propagaton. Our factorzaton assumes that p(θ x) = p(x) and we want to approxmate p(θ x) usng q(θ) as gven by f (θ), (5) q(θ) = f (θ). (6) Z Frst, we calculate KL(p q): KL(p q) = p log (q/p)dx = log g(η) η T u(θ) + constant. (7) To mnmze KL(p q), we form the gradent and set t equal to zero: KL η = η log g(η) E p(θ) [u(θ)] = 0. (8) Note that snce q s determned by η, then optmzaton wth respect to q s equvalent to optmzaton wth respect to η. Therefore, we have Snce q follows an exponental densty dstrbuton as gven by (6), we have η log [g(η)] = E p(θ) [u(θ)]. (9) g(η) q(θ)dθ =, h(θ) exp {η T u(θ)}dθ =. (20) Now, we can dfferentate ths equaton wth respect to η to obtan: η g(η) h(θ) exp {η T u(θ)}dθ + g(η) h(θ) exp {η T u(θ)}u(θ)dθ = 0, (2) where we have used Lebnz rule. It s straghtforward to show that f x = (x,..., x n ) and a = (a,..., a n ), then we have: f(x) = exp {x T a}. Takng dervatves, we get f(x) x = a exp {xt a} = af(x). (22) 3

4 f ( x) f ( x) f ( x) 2 f ( x) 2 Fgure : Approxmaton of ndvdual probablty dstrbuton f (x) wth f (x) f f ( x) 2 f f ( x) 2 ( f f ) ~ 2 ( x) Fgure 2: Product of approxmate ndvdual probablty dstrbutons f (x) f 2 (x) fals to make a good approxmaton of the product of the multplcatons f f 2 (x) whle a good approxmaton could be obtaned for the product tself. It s easy to see that the latter term n (2) s E q(θ) [u(θ)]. Therefore, our optmalty condtons reduce to ηg(η) g(η) = E q(θ)[u(θ)]. (23) Thus, the whole problem reduces to E q(θ) [u(θ)] = E p(θ) [u(θ)]. (24) In other words, at each step of the optmzaton, we need to match the moments between p and q. However, solvng ths equaton s ntractable tself snce t requres calculaton of expectaton wth respect to the orgnal probablty dstrbuton functon p. But we have already decded that be ntractable, whch s why we are usng an approxmaton n the frst place. A plausble smplfcaton would be to approxmate each factor n p usng one factor n q. That s, we could optmze by matchng the moments between f (θ) and f (θ). However, by dong so we lmt ourselves to a lmted subset of the feasble regon. In other words, we elmnate many canddate solutons that effectvely mnmze KL(p q), but for whch the ndvdual moments of f do not match those of f. For example, consder Fgure, where f (x) has two modes whle f 2 (x) has only one. By mnmzng the KLdvergence for each of the dstrbutons, f (x) and f 2 (x) could be obtaned, as shown n Fgure. Now, consder f (x)f 2 (x). Snce f 2 (x) 0 for x n the second mode of f (x), f (x)f 2 (x) only has one mode, as shown n Fgure 2 and s well-approxmated by mnmzng the KL-dvergence. However, f (x) f 2 (x) brngs about a dstrbuton whch poorly approxmates of f (x)f 2 (x), as demonstrated n Fgure 2. 4

5 So, we need somethng more sophstcated than matchng moments factor-by-factor. At each step of EP, we pck some j, and nclude all the factors except for f j, and try to approxmate the result. That s, nstead of matchng moments for each factor of p and q, we match moments for the dstrbutons wth f j and f j mssng. Ths way, f we have large number of factors we obtan a much better approxmaton than factor-by-factor matchng, snce the form of the approxmaton s not as restrcted. To see ths, let N be the total number of factors n p. In approxmatng factor-by-factor, we ntroduce N extra constrants to the orgnal problem. In contrast, we only add a sngle extra constrant by omttng one factor at a tme, whch guarantees a much better result. We generate an teratve method to solve the approxmate problem. Now, we separate factor f j from the approxmate dstrbuton form q(θ) = f j (θ) J f (θ). (25) We defne q j as the dstrbuton functon, where factor j s omtted: q j (θ) = q(θ) f j (θ). (26) Now, we defne q as the dstrbuton functon n whch all the factors are from the orgnal dstrbuton except for f j, whch s chosen from the approxmatng famly q (θ) = Z j f j (θ) q(θ) f j (θ) = z j f j (θ)q j (θ). (27) Then, we wll solve ths smpler problem by mnmzng the KL-dvergence between q and q. q new = arg mn q KL(q (θ) q(θ)), (28) whch s easly solved by moment matchng. Then, we can fnd the updated f j from f j (θ) = K qnew (θ) q j (θ) (29) Unfortunately, there are no theoretcal guarantees for matchng-pursut algorthms n general. So, n general, we cannot say anythng about the qualty of the approxmaton produced by EP, and there exst examples where EP fals to satsfactorly mnmze the KL-dvergence. However, for large N, we have seen that EP at least outperforms moment-matchng on ndvdual factors. And, despte the lack of convergence guarantees, EP works well n practce. 4 Addtonal Examples 4. Clutter Problem In the clutter problem dscussed n [], we have Gaussan observatons of d-dmensonal data x whch are corrupted by nose and embedded n unrelated clutter. Ths gves us a Gaussan mxture model: p(y x) = ( w)n (y; x, I) + wn (y; 0, 0I), (30) 5

6 where I s the dentty matrx and N (y; m, V) denotes the multvarate Gaussan dstrbuton over y wth mean m and covarance V. The frst term n (30) s the (nose-corrupted) desred data, and the second term s dffuse Gaussan clutter. We assume that the data has a Gaussan pror: p(x) = N (x, 0, 00I). (3) Presumably, we pck a large varance to make the pror as non-nformatve as possble. If we have n ndependent observatons D = {y,, y n }, the jont dstrbuton s gven by: p(d, x) = p(x) n p(y x) = = n f (x). (32) So, the factor f 0 s the Gaussan pror, and each addtonal f are the mxture-model lkelhood functons. Usng EP, we fnd a Gaussan approxmaton to p(d, x). Specfcally, we wll choose the sphercal Gaussan dstrbuton N (m x, v x I), where terms are uncorrelated and have the same varance. So, we need to fnd the parameters m x and v x that result n the best approxmate p(d, x). To fnd ths approxmaton wth EP, we frst ntalze the approxmate terms f : =0 ( f (x) = s exp ) (x m ) T (x m ), (33) 2v where s, v, and m are the parameters of the dstrbuton. For f 0, we just ntalze to the parameters of the pror, whch s already Gaussan: v 0 = 00, s 0 = (2πv 0 ) d/2, and m 0 = 0. We ntalze the data terms such that f = : v =, m = 0, and s =. Ths gves us the global parameters m x = 0 and v x = 00. After ntalzaton, we terate untl all (m, v, s ) converge wthn some small ɛ > 0 (n [], ɛ = 0 4 ). At each teraton, we perform the followng steps for each n (note that the notaton \ refers to the set wth element removed):. Remove the factor f from the current posteror estmate, gvng: (v x \ ) = vx v (34) m \ x = x + v \ x v (m x m ). (35) 2. Recompute (m x, v x ) from (m \ x, v \ x ) va moment matchng, and compute the normalzaton constant Z = ( w)n (y ; m \ x, (v \ x + )I) + wn (y ; 0, 0I). (36) That s, we compute Z by evaluatng the estmated lkelhood factor at the observaton y. 3. Update f : v = v x (v \ x ) (37) m = m \ x + (v + v x \ )(v x \ ) (m x m x \ ) (38) Z s = (2πv ) d/2 N (m ; m \ x, (v + v x \ )I). (39) Fnally, when teratons termnate, we can use the approxmated factors to compute the normalzng constant needed 6

7 to perform nference: where p(d) = (2πv x ) d/2 exp(b/2)π n =0s, (40) B = mt x m x v x n =0 m T m v. (4) 4.2 Bayes Pont Machne Mnka n [] dscusses the use of EP for the problem of Bayesan pont classfcaton. In ths problem, we assume that we have some pont w that we wsh to classfy nto one of two groups y = ± through the followng rule: y = sgn(w T x). (42) Gven a tranng set D = {(x, y ),..., (x n, y n )} we can wrte the lkelhood for w as: p(d w) = φ(z) = z p(y x, w) = ( y w T ) x φ. (43) ɛ N (z; 0, )dz, (44) where ɛ s an error tolerance parameter. It can be seen that φ(z) becomes a step functon as ɛ 0. Mnka derves the EP algorthm for ths scenaro assumng a multvarate Gaussan posteror on w wth Gaussan pror and exponental form for the fnal factors f (w). He shows that applyng EP to the BPM problem yelds superor results relatve to other approxmaton methods avalable prevously n the lterature. The results are shown n 3 and are plotted n computatonal requrements (n FLOPS) vs classfcaton error probablty. It can be seen that EP not only performs better aganst the other algorthms n terms of classfcaton error, but s also computatonally the most effcent. 7

8 ) f (w) = s exp( 2v (w T x m ) 2 ). Intalze wth v =, m = 0, s = 2) q(w) = N (m w, V w ). Intalze wth pror m w = 0, V w = I. 3) Loop over =, 2,..., n untl convergence: a) Remove f (w) from the posteror: V w \ = V w + (V wx )(V w x ) T v x T Vwx m \ w = m w + (V w \ x )v (x T m w m ) b) Recompute (m w, V w ) va the followng: z = (m\ w )T x x T V\ w x + α = x T V\ w x + N (z ;0,) φ(z ) m w = m \ w + V w \ α x ( V w = V w \ (V w \ x ) c) Update f (w): α (x m w+α ) x T V\ w x + v = xt V\ w x + α (x T m w+α ) xt V\ w x m = x T m\ w + (v + x T V\ w x )α s = φ(z ) ( exp 2 +v x T V\ w x ) x T V\ w x + x T α m w +α 4) Compute: B = m w Vw m w m 2 v p(d) V w /2 exp(b/2) n = s ) (V \ w x ) T Fgure 3: Comparson of BPM usng EP wth other classcal methods. 8

9 References [] T. Mnka, Expectaton propagaton for approxmate Bayesan nference, Proc. 7th Conf. Uncertanty n Artfcal Intellgence, 200. [2] C. M. Bshop, Pattern Recognton and Machne Learnng. Oxford: Sprnger,

Gaussian process classification: a message-passing viewpoint

Gaussian process classification: a message-passing viewpoint Gaussan process classfcaton: a message-passng vewpont Flpe Rodrgues fmpr@de.uc.pt November 014 Abstract The goal of ths short paper s to provde a message-passng vewpont of the Expectaton Propagaton EP