Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Lecture 4 Bayes Classfcaton Rule Dept. of Electrcal and Computer Engneerng 0909.40.0 / 0909.504.04 Theory & Applcatons of Pattern Recognton A pror / A posteror prob. Loss functon Bayes decson rule The lkelhood rato test Maxmum a posteror (MAP) Crteron Mn. error rate classfcaton Dscrmnant functons Error bounds and prob. Background: Pattern Classfcaton, Duda, Hart and Stork, Copyrght John Wley and Sons, 00, PR logo Copyrght Rob Polkar, 00
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Today n PR Revew of Bayes theorem Bayes Decson Theory Bayes rule Loss functon & expected loss Mnmum error rate classfcaton Classfcaton usng dscrmnant functons Error bounds & probabltes
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Bayes Rule Suppose, we know P(ω ), P(ω ), P(x ω ) and P(x ω ), and that we have observed the value of the feature (a random varable) x How would you decde on the state of nature type of fsh, based on ths nfo? Bayes theory allows us to compute the posteror probabltes from pror and classcondtonal probabltes Lkelhood: The (class-condtonal) probablty of observng a feature value of x, gven that the correct class s ω. All thngs beng equal, the category wth hgher class condtonal probablty s more lkely to be the correct class. Posteror Probablty: The (condtonal) probablty of correct class beng ω, gven that feature value x has been observed P( ω x) ( x Iω ) P P( x ω ) P( ω ) = = C P( x) P( x ω ) P( ω ) k = k Pror Probablty: The total probablty of correct class beng class ω determned based on pror experence k Evdence: The total probablty of observng the feature value as x
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Bayes Decson Rule Choose ω f P(ω x) > P(ω x) for all,,=,,,c If there are multple features, x={x, x,, x d } Choose ω f P(ω x) > P(ω x) for all,=,,,c
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ The Loss Functon Mathematcal descrpton of how costly each acton (makng a class decson) s. Are certan mstakes costler than others? {ω, ω,, ω c }: Set of states of nature (classes) {α, α, α a }: Set of possble actons. Note that a need not be same as c. Because we may make more (or less) number of actons than the number of classes. For example, not makng a decson (reecton) s also an acton. {λ, λ, λ a }: Losses assocated wth each acton λ(α ω }: The loss functon: Loss ncurred by takng acton when the true state of nature s n fact. R(α x): Condtonal rsk - Expected loss for takng acton c R( α x) = λ( α ω ) P( ω = x) Bayes decson takes the acton that mnmzes the condtonal rsk!
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Bayes Decson Rule Usng Condtonal Rsk. Compute condtonal rsk R( α x) = λ( α ω ) P( ω x) for each acton taken. Select the acton that has the mnmum condtonal rsk. Let ths be acton k c = 3. The overall rsk s then R = x X R(α k x ) p( x) dx Integrated over all possble values of x Condtonal rsk assocated wth takng acton α(x) based on the observaton x. Probablty that x wll be observed 4. Ths s the Bayes Rsk, the mnmum possble rsk that can be taken by any classfer!
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Two-Class Specal Case Defntons: α : Decde on ω, α : Decde on ω, λ : λ(α ω ) Loss for decdng on ω when the SON s ω Condtonal rsk: R(α x) = λ P(ω x)+ λ P(ω x) R(α x) = λ P(ω x)+ λ P(ω x) Note that λ and λ need not be zero, though we expect λ < λ, λ < λ Decde on ω f R(α x) < R(α x), decde on ω, otherwse ( x ω ) ( x ω ) p ω Λ( x ) = > p < ω ( λ λ ) ( λ λ ) P P ( ω ) ( ω ) The Lkelhood Rato Test (LRT): Pck ω f the LRT s greater then a threshold that s ndependent of x. Ths rule, whch mnmzes the Bayes rsk, s also called the Bayes Crteron.
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Example From R. Guterrez @ TAMU
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Example (to be Fully solved on Request on Frday) λ λ λ λ Modfed from R. Guterrez @ TAMU
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Mnmum Error-Rate Classfcaton: Multclass Case If we assocate takng acton as selectng class, and f all errors are equally lkely, we obtan the zero-one loss (symmetrcal cost functon) ( ω ) λ α 0, =, Ths loss functon assgns no loss to correct classfcaton, and assgns to msclassfcaton. The rsk correspondng to ths loss functon s then R( α x) = P( ω x) What does ths tell us? To mnmze ths rsk (average probablty of error), we need to choose the class that maxmzes the posteror probablty P(ω x) f f = P( ω x) = =,..., c Λ ( x) = p p ( x ω ) ( x ω ) > < ω ω P P ( ω ) ( ω ) P P ( ω x) ( ω x) ω > < ω Maxmum a posteror (MAP) crteron Maxmum lkelhood crteron for equal prors
Error Probabltes (Bayes Rule Rules!) In a two class case, there are two sources of error: x s n R, yet SON s ω, or vce versa P( error) = = R p ( error x) p( x) dx ( x) p( x) dx + p( x) p( x) dx p ω ω R x B : Optmal Bayes soluton x*: Non-optmal soluton P(error) = + P( x R, ω) = P( x R ω) P( ω) Theory and Applcatons of Pattern Recognton P( x R, ω) = P( x R ω) P( ω) 003, Rob Polkar, Rowan Unversty, Glassboro, NJ
Probablty of Error In mult-class case, there are more ways to be wrong then to be rght, so we explot the fact that P(error)=-P(correct), where P C ( correct) = P( x R, ω ) = P( x R ω ) P( ω ) = = C = x R P ( x ω ) P( ω ) dx = P( ω x) P( x) dx = x R Of course, n order to mnmze the P(error), we need to maxmze P(correct) for whch we need to maxmze each and every one of the ntegrals. Note that P(x) s common to all ntegrals, therefore the expresson wll be maxmzed by choosng the decson regons R where the posteror probabltes P(ω x) are maxmum: C = C Theory and Applcatons of Pattern Recognton From R. Guterrez @ TAMU x 003, Rob Polkar, Rowan Unversty, Glassboro, NJ
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Dscrmnant Based Classfcaton A dscrmnant s a functon g(x), that dscrmnates between classes. Ths functon assgns the nput vector to a class accordng to ts defnton: Choose class f g ( x) > g ( x),, =,,..., c Bayes rule can be mplemented n terms of dscrmnant functons g ( x) = P( ω x) Each dscrmnant functon generates c decson regons, R,,R c, whch are separated by decson boundares. Decson regons need NOT be contguous. The decson boundary satsfes g ( x) = g ( x) x R, g, ( x) > g =,,..., c ( x)
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Dscrmnant Functons We may vew the classfer as an automated machne that computes c dscrmnants and selects the category correspondng to the largest dscrmnant A neural network s one such classfer for Bayes classfer wth non-unform rsks, R(α x): for MAP classfer (of unform rsks): for maxmum lkelhood classfer (of equal prors): g g g ( x ) = R( α x) ( x ) = P( ω x) ( x ) = P( x ω )
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Dscrmnant Functons In fact, multplyng every DF wth the same constant, or addng/subtractng a constant to all DFs does not change the decson boundary In general every g (x) can be replaced by f (g (x) ), where f(.) s any monotoncally ncreasng functon wthout affectng the actual decson boundary Some lnear or non-lnear transformatons of the prevously stated DFs may greatly smplfy the desgn of the classfer What examples can you thnk of?
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Normal Denstes If lkelhood probabltes are normally dstrbuted, then a number of smplfcatons can be made. In partcular, the dscrmnant functon can be wrtten as n ths greatly smplfed form (!) p( x ω ) = p d / / ( π ) Σ ( x ω ) ~ N( µ, Σ ) e T [( x µ ) Σ ( x µ )] g ( x) = [ T ( ) ( )] d x µ Σ x µ lnπ ln Σ + ln P( ω ) There are three dstnct cases that can occur:
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Σ = σ Case : I Features are statstcally ndependent, and all features have the same varance: Dst. are sphercal n d dmensons, the boundary s a generalzed hyperplane (lnear dscrmnant) of d- dmensons, and features create equal szed hypersphercal clusters. Examples of such hypersphercal clusters are: The general form of the dscrmnant s then g x µ ( x) = + ln P( ω ) σ If prors are the same: g ( x) = T ( x µ ) ( x µ ) σ Mnmum Dstance Classfer
Σ = σ Case : I Ths case results n lnear dscrmnants that can be wrtten n the form w T = µ, 0 ln ( ) w = µ µ + P ω σ σ g + T ( x) = w x w 0 Threshold (Bas) of the th category -D case 3-D case -D case Note how prors shft the dscrmnant functon away from the more lkely mean!!! Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Σ = Σ Case : Covarance matrces are arbtrary, but equal to each other for all classes. Features then form hyperellpsodal clusters of equal sze and shape. Ths also results n lnear dscrmnant functons whose decson boundares are agan hyperplanes: T g ( ) x = ( x µ ) Σ ( x µ ) + ln P ω g + T ( x) = w x w 0 w = Σ µ, [ ] ( ) T w 0 = µ Σ µ + ln P( ω )
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Case 3: Σ = Arbtrary All bets are off!in two class case, the decson boundares form hyperquadratcs. The dscrmnant functons are now, n general, quadratc (nor lnear) and non-contguous T T W g ( x) = x W x + w x + w = Σ, w = Σ µ 0 T w 0 = µ Σ µ ln Σ + ln P( ω ) Hyperbolc Parabolc Lnear Ellpsodal Crcular
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Case 3: Σ = Arbtrary For the mult class case, the boundares wll look even more complcated. As an example Decson Boundares
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Case 3: Σ = Arbtrary In 3-D
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Conclusons The Bayes classfer for normally dstrbuted classes s n general a quadratc classfer and can be computed The Bayes classfer for normally dstrbuted classes wth equal covarance matrces s a lnear classfer For normally dstrbuted classes wth equal covarance matrces and equal prors s a mnmum Mahalanobs dstance classfer For normally dstrbuted classes wth equal covarance matrces proportonal to the dentty matrx and wth equal prors s a mnmum Eucldean dstance classfer Note that usng a mnmum Eucldean or Mahalanobs dstance classfer mplctly makes certan assumptons regardng statstcal propertes of the data, whch may or may not and n general are not true. However, n many cases, certan smplfcatons and approxmatons can be made that warrant makng such assumptons even f they are not true. The bottom lne n practce n decdng whether the assumptons are warranted s does the damn thng solve my classfcaton problem?
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Error Bounds It s dffcult, at best f possble, to analytcally compute the error probabltes, Partcularly when the decson regons are not contguous. However, upper bounds for ths error can be obtaned: The Chernoff bound and ts approxmaton Bhattacharya bound are two such bounds that are often used. If the dstrbutons are Gaussan, these expressons are relatvely easer to compute Often tmes even non-gaussan cases are consdered as Gaussan.