Why Bayesan? 3. Bayes and Normal Models Alex M. Martnez alex@ece.osu.edu Handouts Handoutsfor forece ECE874 874Sp Sp007 If all our research (n PR was to dsappear and you could only save one theory, whch one would you save? Bayesan theory s probably the most mportant one you should keep. It s smple, ntutve and optmal. Reverend Bayes (763 and Laplace (8 set the foundatons of what we now know a Bayes theory. State of nature: class A sample (usually corresponds to a state of nature; e.g. salmon and sea bass. he state of nature usually corresponds to a set of dscreet categores (classes. Note that the contnuous also exsts. Prors: some classes mght occur more often or mght be more mportant, P w. ( Decson rule We need a decson rule to help us determne to whch class a testng vector belongs. Smplest (useless: C arg max P( w Posteror probablty: w, x s the (observed data. Obvously, we do not have w. But we can estmate: p x w & P w. ( ( Bayes heorem (yes, the famous one x w P( w w x c x x w P( Bayes decson rule: w arg max guarantees 0 p ( w w Rev. homas Bayes (70-76 Durng hs lfetme, Bayes was a defender of Isaac Newton's calculus, and developed several mportant results of whch the Bayes heorem s hs most known and, arguably, most elegant. hs theorem and the subsequent development of Bayesan theory are among the most relevant topcs n pattern recognton and have found applcatonsn almost every corner of the scentfc world. Bayes hmself dd not, however, provde the dervatons of the Bayes heorem as ths s now known to us. Bayes develop the method for unform prors.hs result was later extended by Laplace and contemporares.nonetheless, Bayes s generally acknowledge as the frst to have establshed a mathematcal bass for probablty nference.
Multple random varables o be mathematcally precse, one should wrte p ( x w nstead of x w, because ths probablty densty functon depends on a sngle random varable. In general there s no need for dstncton (e.g., p & py. Shall ths arse, we wll use the above notaton. Loss functon & decson rsk States exactly how costly each acton s, and s used to convert a probablty determnaton nto a decson. classes: { c w,..., w } actons:,..., } loss functon: w ( { a w the cost (rsk( rsk of gong from to (see Appendx A.4 Condtonal Rsk: c R( ( w w Bayes decson rule Condtonal rsk: c Bayes decson rule (Bayes( rsk: R( ( w w arg mn R( he resultng mnmum overall rsk s called the Bayes rsk. A smple example wo-class classfer: R( w w R w w ( x Decson rule: (or w f R( R (, w ( w ( x Applyng Bayes: x w P( w. x w P( w Notaton: Notaton: w ( threshold Feature Space: Geometry threshold d When x, we have our d-dmensonal feature space. Sometmes, ths feature space s consdered to be an Eucldean space; but as we l see many other alternatves exsts. hs allows for the study of PR problems form a geometrc pont of vew. hs s key to many algorthms.
Dscrmnant functons We can construct a set of dscrmnant functons: g (x, =,, c. We classfy a feature vector as w f: g ( x g ( x, he Bayes classfer s: g ( x R(. If errors are to be mnmzed, one needs to mnmze the probablty of error: mnmum 0 ( w (zero-one one loss mnmum-error-raterate If we use Bayes & mnmum-error-rate classfcaton, we get: x w P( w g ( x p ( w c x w P( w Sometme we fnd more convenent to wrte ths equaton as: g ( x ln x w ln P( w. Geometry (key pont: the goal s to dvde the feature space nto c decson regons, R,..., R. c Classfcaton s also known as hypothess testng. constant Key: the effect of any decson rule s to dvde the feature space nto decson regons. Symbolc representaton Other crteron In some applcatons the prors are not known. In ths case, we usually attempt to mnmze the worst overall rsk. wo approaches for that are: the mnmax and the Neyman-Pearson crtera. 3
Normal Dstrbutons & Bayes So far we have used x w and P( w to specfy the decson boundares of a Bayes classfer. he Normal dstrbuton s the most typcal PDF for. Recall the central lmt theorem. Central Lmt theorem (smplfed Assume that the random varables,..., n are d, each wth fnte mean and varance. When n, the standardzed random varable converges to a normal dstrbuton. (see Stark & Woods pp. 5-30 * * = = Unvarate case he Gaussan dstrbuton s: x x exp ~ N(, Multvarate case (d> x d / / ( ( x ( x ~ (, N xp E ( x ( x dx ( x E (( x x dx 4
Dstances he general dstance n a space s gven by: d ( x ( x where s the covarance matrx of the dstrbuton (or data. If I then the above equaton becomes the Eucldean (norm dstance. If s Normal, ths dstance s called Mahalanobs dstance. Example (D Normals Heteroscedastc Homoscedastc Moments of the estmates In statstcs the estmates are generally known as the moments of the data. he frst moment s the sample mean. he second, the sample autocorrelaton matrx: n S x x. n Central moments he varance and the covarance matrx are specal cases, because they depend on the mean of the data whch s unknown. Usually we solve that by usng the sample mean: n ˆ ( x ˆ ( x ˆ. n hs s the sample covarance matrx. Whtenng ransformaton Recall, t s sometme convenent to represent the data n a space where ts sample covarance matrx equals the dentty matrx, I. A ΣA I Lnear transformatons An n-dmensonal vector can be transformed lnearly to another, Y, as: Y A he mean s then: M Y E( Y A M he cov.: ΣY A Σ A he order of the dstances n the transformed space s dentcal to the one n the orgnal space. 5
Orthonormal transformaton ΣΦΦΛ Egenanalyss: Egenvectors: ΦΦ [ ],..., Φp Egenvalues: 0 Λ 0 p he trasformaton s then: Y Φ Σ Y Φ Σ ΦΛ (recall, & Φ Φ Φ ΦI Whtenng o obtan a covarance matrx equal to the dentty matrx we can apply the orthogonal transformaton frst and then normalze the result wth Λ / : / Y Y Φ Φ / Σ Φ Σ Φ / Φ Σ Φ ----------- I Propertes Whtenng transformatons are not orthogonal transformatons. herefore, Eucldean dstances are not preserved. After whtenng, the covarance matrx s nvarant to any orthogonal transformaton: I I. Smultaneous dagonalzaton It s usually the case where two or more covarance matrces need to be dagonal. Assume Σ and Σ are two covarance matrces. Our goal s to have: A ΣA I and A ΣA Λ. Homework: fnd the algorthm. 6
Some advantages Algorthms usually become much smpler after dagonalzaton or whtenng. he general dstance => a smple Eucldean dstance. Whtened data s nvarant to other orthogonal transformatons. Some algorthms requre whtenng to have certan p r ope r t e s( we l ls e et h sl a t t e r nt hec o ur s e. I Dscrmnant Functons for Normal PDFs he dscrmnant functon, g (x ln x w ln p (w, for the Normal densty, N (,, s: same prors d g (x (x (x ln ln ln p ( w Possble scenaros (or assumptons: Sometmes, we mght be able to assume I. A more general case s when all covarance matrces are dentcal ;.e. homoscedastc. he most complex case s when arbtrary ; that s, heteroscedastc. I he Bayes bound s a d- dmensonal hyperplane perpendcular to the lne that passes through both means. g (x x Homoscedastc: Homoscedastc: ln p (w g (x (x (x ln w Mahalanobs 7
Heterodscedastc arbtrary In the -class case, the decson surface s an hyperquadrc (e.g. hyperplanes, hyperspheres, hyperhyperbolods, etc.. hese decson boundares may not be connected. Any hyperquadrc can be gven (represented by two Gaussan dstrbutons. arbtrary Proect #. Implement these three cases usng Matlab (see pp. 36-4 for detals. D and/or 3D plots.. Generalze the algorthm to more than two classes Gaussans. 3. Smulate dfferent Gaussans and dstnct prors. P (error P( x R, w P (x R, w Bayes Is Optmal p ( x w P ( w dx p (x w P( w dx R If our goal s to mnmze the classfcaton error, then Bayes s optmal (you cannot do better than Bayes ever. In general, f x w P( w p (x w P( w, t s preferable to classfy x n w so that the smallest ntegral contrbutes to the error (see next slde => hs s what Bayes does. here s no possble smaller error. R Bayes 8
he multclass case: Error Bounds: How to calculate the error? C P(correct P( x R, w C x w P( w dx. R Bayes yelds the smallest error. But whch s the actual error? he above equaton cannot be readly computed, because the regons R may be very complex. Chernoff Bound For ths, we need an ntegral eq. that we can solve. For example, P(error P(error p ( x dx, where P( w, f we decde w P (error P ( w, f we decde w. p s (x w p s (x w dx e k ( s, where s ( s k ( s s ( s s ( s ln s s. Several approxmatons are easer to compute (usually upper bounds: Chernoff bound. Bhattacharyya bound (assumes pdf are homoscedastc. hese bound can only be appled to the class case only. Or, we can also wrte: P (error mn p ( x w P( w, p (x w P( w dx s s Snce, t s known that mn( a, b a b, 0 s, we can now wrte: P (error P s ( w P s ( w p s ( x w p s (x w dx If the condtonal probabltes are normal, we can solve ths analytcally: Bhattacharyya Bound When the data s homoscedastc,, the optmal soluton s s=/. hs s the Bhattacharyya bound. A tghter bound s the asymptotc nearest neghbor error, whch s derved from: x w P ( w x w P( w error dx p ( x x w P ( w x w P ( w dx. 9
Closng Notes Bayes s mportant because t mnmzes the pr oba b l t yo fe r r o r.i nt ha ts e ns ewes a y t s optmal. Unfortunately, Bayse assumes that the condtonal denstes and prors are known (or can be estmated; whch s not necessarly true. In general, not even the form of these probabltes s known. Most PR approaches attempt to solve these shortcomngs. hs s, n fact, what most of PR s all about. On the + sde: a smple example We want to predct whether a student wll pass or not a test. Y= denotes pass. Y=0 falure. he observaton s a sngle random varable whch specfes the hours of study. x Let P(Y x c x. hen: f P (Y x / ;.e. x c g ( x 0 otherwse Optonal homework Hnts Usng Matlab generate n observatons of P(Y= =x and P(Y=0 =x. Approxmate each usng a Gaussan dstrbuton. Calculate the Bayes decson boundary and classfcaton error. Select several arbtrary values for c and see how well you can approxmate them. Error = mn [P(Y= =x, P(Y=0 =x]. Plot the orgnal dstrbuton to help you. 0