3.1 ML and Empirical Distribution

Size: px

Start display at page:

Download "3.1 ML and Empirical Distribution"

Nickolas Mason
6 years ago
Views:

1 67577 Intro. to Machne Learnng Fall semester, 2008/9 Lecture 3: Maxmum Lkelhood/ Maxmum Entropy Dualty Lecturer: Amnon Shashua Scrbe: Amnon Shashua 1 In the prevous lecture we defned the prncple of Maxmum Lkelhood (ML): suppose we have random varables X 1,..., X n form a random sample from a dscrete dstrbuton whose jont probablty dstrbuton s P (x φ) where x = (x 1,..., x n ) s a vector n the sample and φ s a parameter from some parameter space (whch could be a dscrete set of values say class membershp). When P (x φ) s consdered as a functon of φ t s called the lkelhood functon. The ML prncple s to select the value of φ that maxmzes the lkelhood functon over the observatons (tranng set) x 1,..., x m. If the observatons are sampled..d. (a common, not always vald, assumpton), then the ML prncple s to maxmze: φ m m m = argmax φ P (x φ) = argmax log P (x φ) = argmax log P (x φ) =1 =1 =1 whch due to the product nature of the problem t becomes more convenent to maxmze the log lkelhood. We wll take a closer look today at the ML prncple by ntroducng a key element known as the relatve entropy measure between dstrbutons. 3.1 ML and Emprcal Dstrbuton The ML prncple states that the emprcal dstrbuton of an..d. sequence of examples s the closest possble (n terms of relatve entropy whch would be defned later) to the true dstrbuton. To make ths statement clear let X be a set of symbols {a 1,..., a n } and let P (a θ) be the probablty (belongng to a parametrc famly wth parameter θ) of drawng a symbol a X. Let x 1,..., x m be a sequence of symbols drawn..d. accordng to P. The occurrence frequency f(a) measures the number of draws of the symbol a: and let the emprcal dstrbuton be defned by f(a) = { : x = a}, ˆP (a) = 1 α X f(α)f(a) = 1 f(a) = (1/m)f(a). f 1 The jont probablty P (x 1,..., x m φ) s equal to the product P (x φ) whch accordng to the defntons above s equal to: 1 m P (x 1,..., x m φ) = p(x θ) = P (a φ) f(a). =1 a X 3-1

2 Lecture 3: Maxmum Lkelhood/ Maxmum Entropy Dualty 3-2 The ML prncple s therefore equvalent to the optmzaton problem: max P Q P (a φ) f(a) (3.1) a X where Q = {q R n : q 0, q = 1} denote the set of n-dmensonal probablty vectors ( probablty smplex ). Let p stand for P (a φ) and f stand for f(a ). Snce argmax x z(x) = argmax x ln z(x) and gven that ln pf = f ln p the soluton to ths problem can be found by settng the partal dervatve of the Lagrangan to zero: n L(p, λ, µ) = f ln p λ( =1 p 1) µ p, where λ s the Lagrange multpler assocated wth the equalty constrant p 1 = 0 and µ 0 are the Lagrange multplers assocated wth the nequalty constrants p 0. We also have the complementary slackness condton that sets µ = 0 f p > 0. After settng the partal dervatve wth respect to p to zero we get: p = 1 λ + µ f. Assume for now that f > 0 for = 1,..., n. Then from complementary slackness we must have µ = 0 (because p > 0). We are left therefore wth the result p = (1/λ)f. Followng the constrant p 1 = 1 we obtan λ = f. As a result we obtan: P (a φ) = ˆP (a). In case f = 0 we could use the conventon 0 ln 0 = 0 and from contnuty arrve to p = 0. We have arrved to the followng theorem: Theorem 1 The emprcal dstrbuton estmate ˆP s the unque Maxmum Lkelhood estmate of the probablty model Q on the occurrence frequency f(). Ths seems lke an obvous result but t actually runs deep because the result holds for a very partcular (and non-ntutve at frst glance) dstance measure between non-negatve vectors. Let dst(f, p) be some dstance measure between the two vectors. The result above states that: ˆP = argmnpdst(f, p) s.t. p 0, p = 1, (3.2) for some (famly?) of dstance measures dst(). It turns out that there s only one 2 such dstance measure, known as the relatve-entropy, whch satsfes the ML result stated above. 3.2 Relatve Entropy The relatve-entropy (RE) measure D(x y) between two non-negatve vectors x, y R n s defned as: n D(x y) = x ln x x + y. y =1 2 not exactly the pcture s a bt more complex. Csszar s 1972 measures: dst(p, f) = fφ(p/f) wll satsfy eqn. 3.2 provded that φ 1 s an exponental. However, dst(f, p) (parameters postons are swtched) wll not do t, whereas the relatve entropy wll satsfy eqn. 3.2 regardless of the order of the parameters p, f.

3 Lecture 3: Maxmum Lkelhood/ Maxmum Entropy Dualty 3-3 In the defnton we use the conventon that 0 ln 0 0 = 0 and based on contnuty that 0 ln 0 y = 0 and x ln x 0 =. When x, y are also probablty vectors,.e., belong to Q, then D(x y) = x ln x y s also known as the Kullback-Lebler dvergence. The RE measure s not a dstance metrc as t s not symmetrc, D(x y) D(y x), and does not satsfy the trangle nequalty. Nevertheless, t has several nterestng propertes whch make t a fundamental measure n statstcal nference. The relatve entropy s always non-negatve and s zero f and only f x = y. Ths comes about from the log-sum nequalty: x ln x ( x ) ln x y y Thus, D(x y) ( x ) ln x y x + y = x ln x ȳ x + ȳ But a ln(a/b) a b for a, b 0 ff ln(a/b) 1 (b/a) whch follows from the nequalty ln(x + 1) > x/(x + 1) (whch holds for x > 1 and x 0). We can state the followng theorem: Theorem 2 Let f 0 be the occurrence frequency on a tranng sample. ˆP Q s a ML estmate ff ˆP = argmnpd(f p) s.t. p 0, p = 1. Proof: D(f p) = f ln p + f ln f f + 1, and argmnpd(f p) = argmaxp f ln p = argmaxp ln p f. There are two (related) nterestng ponts to make here. Frst, from the proof of Thm. 1 we observe that the non-negatvty constrant p 0 need not be enforced - as long as f 0 (whch holds by defnton) the closest p to f under the constrant p = 1 must come out non-negatve. Second, the fact that the closest pont p to f comes out as a scalng of f (whch s by defnton the emprcal dstrbuton ˆP ) arses because of the relatve-entropy measure. For example, f we had used a least-squares dstance measure f p 2 the result would not be a scalng of f. In other words, we are lookng for a projecton of the vector f onto the probablty smplex,.e., the ntersecton of the hyperplane x 1 = 1 and the non-negatve orthant x 0. Under relatve-entropy the projecton s smply a scalng of f (and ths s why we do not need to enforce non-negatvty). Under leastsqaures, a projecton onto the hyper-plane x 1 = 1 could take us out of the non-negatve orthant (see Fg. 3.1 for llustraton). So, relatve-entropy s specal n that regard t not only provdes the ML estmate, but also smplfes the optmzaton process 3 (somethng whch would be more notceable when we handle a latent class model next lecture). 3.3 Maxmum Entropy and Dualty ML/MaxEnt The relatve-entropy measure s not symmetrc thus we expect dfferent outcomes of the optmzaton mn x D(x y) compared to mn y D(x y). The latter of the two,.e., mn P Q D(P 0 P ), where 3 The fact that non-negatvty comes for free does not apply for all class (dstrbuton) models. Ths pont would be refned n the next lecture.

4 Lecture 3: Maxmum Lkelhood/ Maxmum Entropy Dualty 3-4 f p 2 p^ Fgure 3.1: Projecton of a non-neagtave vector f onto the hyperplane x 1 = 0. Under relatveentropy the projecton ˆP s a scalng of f (and thus lves n the probablty smplex). Under least-squares the projecton p 2 lves outsde of the probablty smplex,.e., could have negatve coordnates. P 0 s some emprcal evdence and Q s some model, provdes the ML estmaton. For example, n the next lecture we wll consder Q the set of low-rank jont dstrbutons (called latent class model) and see how the ML (va relatve-entropy mnmzaton) soluton can be found. Let H(p) = p ln p denote the entropy functon. Wth regard to mn x D(x y) we can state the followng observaton: Clam 1 argmnp QD(p 1 n 1) = argmax p QH(p). Proof: D(p 1 n 1) = p ln p + ( p ) ln(n) = ln(n) H(p), whch follows from the condton p = 1. In other words, the closest dstrbuton to unform s acheved by maxmzng the entropy. To make ths nterestng we need to add constrants. Consder a lnear constrant on p such as α p = β. To be concrete, consder a de wth sx faces thrown many tmes and we wsh to estmate the probabltes p 1,..., p 6 gven only the average p. Say, the average s 3.5 whch s what one would expect from an unbased de. The Laplace s prncple of nsuffcent reasonng calls for assumng unformty unless there s addtonal nformaton (a controversal assumpton n some cases). In other words, f we have no nformaton except that each p 0 and that p = 1 we should choose the unform dstrbuton snce we have no reason to choose any other dstrbuton. Thus, employng Laplace s prncple we would say that f the average s 3.5 then the most lkely dstrbuton s the unform. What f β = 4.2? Ths knd of problem can be stated as an optmzaton problem: max p H(p) s.t., p = 1, α p = β,

5 Lecture 3: Maxmum Lkelhood/ Maxmum Entropy Dualty 3-5 where α = and β = 4.2. We have now two constrants and wth the ad of Lagrange multplers we can arrve to the result: p = exp (1 λ) exp µα. Note that because of the exponental p 0 and agan non-negatvty comes for free 4. Followng the constrant p = 1 we get exp (1 λ) = 1/ expµα from whch obtan: p = 1 Z expµα, where Z (a functon of µ) s a normalzaton factor and µ needs to be set by usng β (see later). There s nothng specal about the unform dstrbuton, thus we could be seekng a probablty vector p as close as possble to some pror probablty p 0 under the constrants above: mn p D(p p 0) s.t., p = 1, α p = β, wth the result: p = 1 Z p 0 exp µα. We could also consder addng more lnear constrants on p of the form: f jp = b j, j = 1,..., k. The result would be: p = 1 Z p 0 exp k j=1 µ jf j. Probablty dstrbutons of ths form are called Gbbs Dstrbutons. In practcal applcatons the lnear constrants on p could arse from average nformaton about the system such as temperature of a flud (where p are the probabltes of the partcles movng at varous veloctes), ranfall data or general envronmental data (where p represent the probablty of fndng anmal colones at dscrete locatons n a 3D map). A constrant of the form f jp = b j states that the expectaton E p [f j ] should be equal to the emprcal dstrbuton β = E ˆP [f j ] where ˆP s ether unform or gven as nput. Let P = {p R n : p 0, p = 1, E p [f j ] = Eˆp [f j ], j = 1,..., k}, and Q = {q R n ; q s a Gbbs dstrbuton} We could therefore consder lookng for the ML soluton for the parameters µ 1,..., µ k of the Gbbs dstrbuton: mn q Q D(ˆp q), where f ˆp s unform then mn D(ˆp q) can be replaced by max ln q (because D((1/n)1 x) = ln(n) ln x ). As t turns out, the MaxEnt and ML are duals of each other and the ntersecton of the two sets P Q contans only a sngle pont whch solves both problems. Theorem 3 The followng are equvalent: MaxEnt: q = argmnp PD(p p 0 ) 4 Any measure of the class dst(p, p 0 ) = p0 φ(p /p 0 ) mnmzed under lnear constrants wll satsfy the result of p 0 provded that φ 1 s an exponental.

6 Lecture 3: Maxmum Lkelhood/ Maxmum Entropy Dualty 3-6 ML: q = argmnq QD(ˆp q) q P Q In practce, the dualty theorem s used to recover the parameters of the Gbbs dstrbuton usng the ML route (second lne n the theorem above) the algorthm for dong so s known as the teratve scalng algorthm (whch we wll not get nto).

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg prnceton unv. F 17 cos 521: Advanced Algorthm Desgn Lecture 7: LP Dualty Lecturer: Matt Wenberg Scrbe: LP Dualty s an extremely useful tool for analyzng structural propertes of lnear programs. Whle there