Bayes Decso Theory - II Ke Kreutz-Delgado (Nuo Vascocelos) ECE 175 Wter 2012 - UCSD
Nearest Neghbor Classfer We are cosderg supervsed classfcato Nearest Neghbor (NN) Classfer A trag set D = {(x 1,y 1 ),, (x,y )} x s a vector of observatos, y s the correspodg class label a vector x to classfy The NN Decso Rule s Set y where y * * arg m d( x, x ) {1,..., } argm meas: the that mmzes the dstace 2
Optmal Classfers We have see that performace depeds o metrc Some metrcs are better tha others The meag of better s coected to how well adapted the metrc s to the propertes of the data But ca we be more rgorous? what do we mea by optmal? To talk about optmalty we defe cost or loss x yˆ f ( x) () f L( y, yˆ ) Loss s the fucto that we wat to mmze Loss depeds o true y ad predcto Loss tells us how good our predctor s ŷ 3
Loss Fuctos & Classfcato Errors Loss s a fucto of classfcato errors What errors ca we have? Two types: false postves ad false egatves cosder a face detecto problem (decde face or o-face ) f you see ths ad say face o-face you have a false postve false-egatve (false alarm) (mss, falure to detect) Obvously, we have correspodg sub-classes for o-errors true-postves ad true-egatves postve/egatve part reflects what we say or decde, true/false part reflects the true class label ( true state of the world ) 4
(Codtoal) Rsk To wegh dfferet errors dfferetly We troduce a loss fucto Deote the cost of classfyg X from class as j by L j Oe way to measure how good the classfer s to use the (datacodtoal) expected value of the loss, aka the (codtoal) Rsk, R( x, ) E{ L[ Y ] x} L j PYX( j x) j Note that the (data-codtoal) rsk s a fucto of both the decso decde class ad the codtog data (measured feature vector), x. 5
Loss Fuctos example: two sakes ad eatg posoous dart frogs Regular sake wll de Frogs are a good sack for the predator dart-sake Ths leads to the losses Regular sake dart frog regular frog regular 0 dart 0 10 Predator sake dart frog regular frog regular 10 0 dart 0 10 What s optmal decso whe sakes fd a frog lke these? 6
Mmum Rsk Classfcato We have see that f both sakes have P the both say regular However, f P YX YX 0 j dart ( j x) 1 j regular 0.1 j dart ( j x) 0.9 j regular the the vulerable sake says dart whle the predator says regular Its fte loss for sayg regular whe frog s dart, makes the vulerable sake much more cautous! 7
BDR = Mmzg Codtoal Rsk Note that the defto of rsk: Immedately defes the optmal classfer as the oe that mmzes the codtoal rsk for a gve observato x The Optmal Decso s the Bayes Decso Rule (BDR) : * ( x) argm R( x, ) argm L j P ( j x). j YX The BDR yelds the optmal (mmal) rsk : R x R x L j P j x * * ( ) (, ) m YX ( ) j 8
What s a Decso Rule? Cosder the c-ary classfcato problem wth class labels, {1,, c}. Gve a observato (feature), x, to be classfed, a decso rule s a fucto d = d(.) of the observato that takes ts values the set of class labels, dx ( ) {1,, c}. * * d ( x) ( x) Note that defed o the prevous slde s a optmal decso rule the sese that for a specfc value of x t mmzes the codtoal rsk R(x,) over all possble decsos C 9
(d-depedet) Total Average Rsk Gve a decso rule d ad the codtoal rsk R(x,), we ca cosder the (d-depedet) codtoal rsk R(x,d(x)). We ca ow defe the total (d-depedet) Expected or Average Rsk (aka d-rsk): R( d) E { R( x, d( x) )} Note that we have averaged over all possble measuremets (features) x that we mght ecouter the world. Note that R(d) s a fucto of a fucto! (A fucto of d) The (d-rsk) R(d) s a measure of how we expect to perform o the average whe we use the fxed decso rule d over-ad-overaga o a large set of real world data. It s atural to ask f there s a optmal decso rule whch mmzes the average rsk R(d) over the class of all possble decso rules. 10
Mmzg the Average Rsk R(d) Optmzg total rsk R(d) seems hard because we are tryg to mmze t over a famly of fuctos (decso rules), d. However, sce R( d) E{ R( x, d( x))} R( x, d( x)) p x) dx, oe ca equvaletly mmze the data-codtoal rsk R(x,d(x)) pot-wse x. I.e. solve for the value of the optmal decso rule at each x : d * ( x) arg m R( x, d( x)) argm R( x, ) Thus d*(x) = *(x)!! I.e. the BDR, whch we already kow optmzes the Data-Codtoal Rsk, ALSO optmzes the Average Rsk R(d) over ALL possble decso rules d!! Ths makes sese: f the BDR s optmal for every sgle stuato, x, t must be optmal o the average over all x 11 X ( 0 d( x)
The 0/1 Loss Fucto A mportat specal case of terest: zero loss for o error ad equal loss for two error types Ths s equvalet to the zero/oe loss : L j 0 1 j j sake predcto dart frog regular frog regular 1 0 dart 0 1 Uder ths loss the optmal Bayes decso rule (BDR) s * * d ( x) x L j PYX j j ( ) arg m ( x) arg m P ( j x) j YX 12
0/1 Loss yelds MAP Decso Rule Note that : * x PYX j x j ( ) arg m ( ) arg m 1 PYX ( x) arg max P ( x) YX Thus the Optmal Decso for the 0/1 loss s : Pck the class that s most probable gve the observato x *(x) s kow as the Maxmum a Posteror Probablty (MAP) soluto Ths s also kow as the Bayes Decso Rule (BDR) for the 0/1 loss We wll ofte smplfy our dscusso by assumg ths loss But you should always be aware that other losses may be used 13
BDR for the 0/1 Loss Cosder the evaluato of the BDR for 0/1 loss * x PYX x ( ) arg max ( ) Ths s also called the Maxmum a Posteror Probablty (MAP) rule It s usually ot trval to evaluate the posteror probabltes P Y X ( x ) Ths s due to the fact that we are tryg to fer the cause (class ) from the cosequece (observato x).e. we are tryg to solve a otrval verse problem E.g. mage that I wat to evaluate P Y X ( perso has two eyes ) Ths strogly depeds o what the other classes are 14
Posteror Probabltes ad Detecto If the two classes are people ad cars the P Y X ( perso has two eyes ) = 1 But f the classes are people ad cats the P Y X ( perso has two eyes ) = ½ f there are equal umbers of cats ad people to uformly choose from [ ths s addtoal fo! ] How do we deal wth ths problem? We ote that t s much easer to fer cosequece from cause E.g., t s easy to fer that P X Y ( has two eyes perso ) = 1 Ths does ot deped o ay other classes We do ot eed ay addtoal formato Gve a class, just cout the frequecy of observato 15
Bayes Rule How do we go from P X Y ( x j ) to P Y X ( j x )? We use Bayes rule: P YX ( x) P Cosder the two-class problem,.e. Y=0 or Y=1 the BDR uder 0/1 loss s X Y ( x ) P ( ) P X ( x) * x PYX x ( ) arg max ( ) 0, f PY X (0 x) PY X (1 x) 1, f PY X (0 x) PY X (1 x) Y 16
BDR for 0/1 Loss Bary Classfcato Pck 0 whe P ad 1 otherwse Y X (0 x) PY X (1 x) Usg Bayes rule o both sdes of ths equalty yelds P (0 x) P (1 x) Y X Y X PX Y ( x 0) PY (0) PX Y ( x 1) PY (1) P ( x) P ( x) X Notg that P X (x) s a o-egatve quatty ths s the same as the rule pck 0 whe P ( x 0) P (0) P ( x 1) P (1) X Y Y X Y Y X.e. * x PX Y x PY ( ) argmax ( ) ( ) 17
The Log Trck Sometmes t s ot coveet to work drectly wth pdf s Oe helpful trck s to take logs Note that the log s a mootocally creasg fucto a b log a from whch we have log b * x PX Y x PY ( ) arg max ( ) ( ) X Y X Y X Y log a log b arg max log P ( x ) P ( ) arg max log P ( x ) log P ( ) arg m log P ( x ) log P ( ) Y Y Y b a 18
Stadard (0/1) BDR I summary for the zero/oe loss, the followg three decso rules are optmal ad equvalet 1) 2) * ( x ) arg max PY X ( x ) ( ) arg max ( ) ( ) * x PX Y x PY 3) * ( x ) arg max log P X Y ( x ) log P ( ) Y The form 1) s usually hardest to use, 3) s frequetly easer tha 2) 19
(Stadard 0/1-Loss) BDR - Example So far the BDR s a abstract rule How does oe mplemet the optmal decso practce? I addto to havg a loss fucto, you eed to kow, model, or estmate the probabltes! Example Suppose that you ru a gas stato O Modays you have a promoto to sell more gas Q: s the promoto workg? I.e., s Y = 0 (o) or Y = 1 (yes)? A good observato to aswer ths questo s the terarrval tme (t) betwee cars hgh t: ot workg (Y = 0) low t: workg well (Y = 1) 20
BDR - Example What are the class-codtoal ad pror probabltes? Model the probablty of arrval of a car by a Expoetal desty (a stadard pdf to use) Cotuous-valued terarrval tmes are assumed to be expoetally dstrbuted. Hece P ( t ) l e lt X Y where l s the arrval rate (cars/s). The expected value of the terarrval tme s XY E x y Cosecutve tmes are assumed to be depedet : 1 l P ( t,, t ) P ( t ) l e lt k X1,, X Y 1 X Y k k1 k1 21
BDR - Example Let s assume that we kow l ad the (pror) class probabltes P Y () = p, = 0,1 Have measured a collecto of tmes durg the day, D = {t 1,...,t } The probabltes are of expoetal form Therefore t s easer to use the log-based BDR ( ) arg max log ( ) log ( ) * PX Y PY lt k arg max logle logp k 1 arg max lt k log l logp k 1 arg max l log t k l p k 1 22
BDR - Example Ths meas we pck 0 whe log k l t log l p l t l p 0 k 0 0 1 1 1 k1 k1 l ( l1l0) t k log k 1 l 1 1 0 0 1 1 l 1 p 1 t k log 1 ( 1 0) k l l l0 p 0 ad 1 otherwse Does ths decso rule make sese? Let s assume, for smplcty, that p 1 = p 2 = 1/2 p p, or, or (reasoably takg l 1 > l 0 ) 23
BDR - Example For p 1 = p 2 = ½, we pck promoto dd ot work (Y=0) f t 1 1 l 1 t k log k1 ( l1 l0 ) l0 The left had sde s the (sample) average terarrval tme for the day Ths meas that there s a optmal choce of a threshold 1 l 1 T log ( l1 l0 ) l0 above whch we say promoto dd ot work. Ths makes sese! T What s the shape of ths threshold? Assumg l 0 = 1, t looks lke ths. Hgher the l 1, the more lkely to say promoto dd ot work. l 1 24
BDR - Example Whe p 1 = p 2 = ½, we pck dd ot work (Y=0) whe t 1 t k k1 T T 1 ( l l ) 1 0 l 1 log l0 T Assumg l 0 = 1, T decreases wth l 1 I.e. for a gve daly average, Larger l 1 : easer to say dd ot work Ths meas that As the expected rate of arrval for good days creases we are gog to mpose a tougher stadard o the average measured terarrval tmes The average has to be smaller for us to accept the day as a good oe Oce aga, ths makes sese! A sesble aswer s usually the case wth the BDR (a good way to check your math) l 1 25
The Gaussa Classfer Oe mportat case s that of Multvarate Gaussa Classes The pdf of class s a Gaussa of mea m ad covarace S f P ( x ) The BDR s 1 exp 1 ( x m ) 2 S ( x m ) T 1 X Y d (2p ) S * 1 T 1 ( x) arg max ( x m) ( x m) S 2 1 log(2 ) d p S log PY ( ) 2 26
Implemetato of a Gaussa Classfer To desg a Gaussa classfer (e.g. homework) Start from a collecto of datasets, where the -th class dataset D () = {x 1 (),..., x () } s a set of () examples from class For each class estmate the Gaussa parameters : ˆ m where 1 () () x j j c () T k 1 ˆ 1 S ( ˆ )( ˆ x m x m ) ( ) ( ) T ( ) j j j Pˆ () s the total umber of examples over all c classes Va the plug rule, the BDR s approxmated as Y T () * 1 T 1 ( ) arg max ( ˆ ) ˆ x x m ( ˆ x m) S 2 1 d log(2 ) ˆ l g ˆ p S o PY( ) 2 27
Gaussa Classfer The Gaussa Classfer ca be wrtte as ( ) = 0.5 x d x m a * 2 ( ) arg m (, ) wth d x y x y x y 2 T 1 (, ) ( ) S ( ) a log( 2p ) d S 2log P Y ( ) ad ca be see as a earest class-eghbor classfer wth a fuy metrc Each class has ts ow dstace measure: Sum the Mahalaobs-squared for that class, the add the a costat. We effectvely have dfferet metrcs the data (feature) space that are class depedet. 28
Gaussa Classfer A specal case of terest s whe All classes have the same covarace S = S x d x m a * 2 ( ) arg m (, ) ( ) = 0.5 wth d x y x y x y 2 T 1 (, ) ( ) S ( ) a 2log ( ) Note that: P Y a ca be dropped whe all classes have equal pror probablty Ths s remscet of the NN classfer wth Mahalaobs dstace Istead of fdg the earest data pot eghbor of x, t looks for the earest class prototype, (or archetype, or exemplar, or template, or represetatve, or deal, or form ), defed as the class mea m 29
Bary Classfer Specal Case Cosder S = S wth two classes Oe mportat property of ths case s that the decso boudary s a hyperplae (Homework) Ths ca be show by computg the set of pots x such that d ( x, m ) a d ( x, m ) a 2 2 0 0 1 1 ad showg that they satsfy ( ) = 0.5 T w ( x x ) 0 0 Ths s the equato of a hyperplae wth ormal w. x 0 ca be ay fxed pot o the hyperplae, but t s stadard to choose t to have mmum orm, whch case w ad x 0 are the parallel x x 1 x 3 x 2 x x 0 0 x w 30
Gaussa M-ary Classfer Specal Case If all the class covaraces are the detty, S =I, the x d x m a * ( ) arg m 2 (, ) wth d 2 ( x, y) x y 2 a 2log ( ) P Y Ths s called (smple, Cartesa) template matchg wth class meas as templates E.g. for dgt classfcato *? Compare the complexty of ths classfer to NN Classfers! 31
The Sgmod Fucto We have derved much of the above from the log-based BDR ( ) arg max log ( ) log ( ) * x PX Y x PY Whe there are oly two classes, = 0,1, t s also terestg mapulate the orgal defto as follows: where * ( x) arg max g ( x) g ( x) P ( x) Y X P P X Y X Y ( x ) P ( ) P ( x) ( x ) P ( ) P ( x 0) P (0) P ( x 1) P (1) X Y Y X Y Y X Y Y 32
The Sgmod Fucto Note that ths ca be wrtte as * ( x) arg max g ( x) g1( x ) 1 g0( x ) g 0 ( x) 1 1 P ( x 1) P (1) X Y P ( x 0) P (0) X Y Y Y For Gaussa classes, the posteror probabltes are g 0 ( ) 1 x 1 exp d ( x, m ) d ( x, m ) a a 2 2 0 0 1 1 0 1 where, as before, d x y x y x y 2 T 1 (, ) ( ) S ( ) a log( 2p ) d S 2log P Y ( ) 33
The Sgmod ( S-shaped ) Fucto The posteror pdf for class = 0, g 0 ( ) 1 x 1 exp d ( x, m ) d ( x, m ) a a s a sgmod ad looks lke ths 2 2 0 0 1 1 0 1 ( 1 ) = 0.5 34
The Sgmod Fucto Neural Nets The sgmod appears eural etworks, where t ca be terpreted as a posteror pdf for a Gaussa bary classfcato problem whe the covaraces are the same 35
The Sgmod Fucto Neural Nets But ot ecessarly whe the covaraces are dfferet 36
END 37