LECTURE 2: Linear and quadratic classifiers

LECURE : Lear ad quadratc classfers g Part : Bayesa Decso heory he Lkelhood Rato est Maxmum A Posteror ad Maxmum Lkelhood Dscrmat fuctos g Part : Quadratc classfers Bayes classfers for ormally dstrbuted classes Eucldea ad Mahalaobs dstace classfers Numercal example g Part 3: Lear classfers Gradet descet he perceptro rule he pseudo-verse soluto Least mea squares Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty

Part : Bayesa Decso heory Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty

he Lkelhood Rato est () g Assume we are to classfy a object based o the evdece provded by a measuremet (or feature vector) x g Would you agree that a reasoable decso rule would be the followg? "Choose the class that s most probable gve the observed feature vector x More formally: Evaluate the posteror probablty of each class P( x) ad choose the class wth largest P( x) Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 3

he Lkelhood Rato est () g Let us exame ths decso rule for a two-class problem I ths case the decso rule becomes f P( x) > P( x) else choose choose g Or, a more compact form Applyg Bayes theorem P( x) < > P( x) P (x )P( ) P(x )P( P(x) < > P(x) ) Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 4

he Lkelhood Rato est (3) P(x) does ot affect the decso rule so t ca be elmated*. Rearragg the prevous expresso P(x ) Λ(x) P(x ) < > P( ) P( ) he term Λ(x) s called the lkelhood rato, ad the decso rule s kow as the lkelhood rato test Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 5

Lkelhood Rato est: a example () g Gve a classfcato problem wth the followg class codtoal destes: P(x ) π e (x 4) P(x ) π e (x 0) P(x ) P(x ) 4 0 x g Derve a classfcato rule based o the Lkelhood Rato est (assume equal prors) Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 6

Lkelhood Rato est: a example () g Soluto Substtutg the gve lkelhoods ad prors to the LR expresso: Smplfyg, chagg sgs ad takg logs: Whch yelds: < x 7 > Λ(x) (x 4) π π e e (x 4) (x 0) (x 0) > < < 0 > hs LR result makes tutve sese sce the lkelhoods are detcal ad dffer oly ther mea value R : say R : say P(x ) P(x ) 4 0 x Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 7

he probablty of error g he probablty of error s the probablty of assgg x to the wrog class For a two-class problem, P[error x] s smply P(error x) P( P( x) x) f f we decde we decde g It makes sese that the classfcato rule be desged to mmze the average prob. of error P[error] across all possble values of x + P(error) P(error,x)dx P(error x)p(x)dx + g o mmze P(error) we mmze the tegrad P(error x) at each x: choose the class wth maxmum posteror P( x) hs s called the MAXIMUM A POSERIORI (MAP) RULE Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 8

Mmzg probablty of error g We prove the optmalty of the MAP rule graphcally he rght plot shows the posteror for each of the two classes he bottom plots shows the P(error) for the MAP rule ad a alteratve decso rule Whch oe has lower P(error) (color-flled area)? P(w x) x HE MAP RULE HE OHER RULE Choose RED Choose BLUE Choose RED Choose RED Choose BLUE Choose RED Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 9

he Bayes Rsk () g So far we have assumed that the pealty of msclassfyg as s the same as the recprocal I geeral, ths s ot the case Msclassfcatos the fsh sortg lead to dfferet costs Medcal dagostcs errors are very asymmetrc g We capture ths cocept wth a cost fucto C j C j represets the cost of choosg class whe class j s the true class g Ad defe the Bayes Rsk as the expected value of the cost R E[C] j C j P[choose ad x j] Cj P[x R j] P[ j] j Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 0

he Bayes Rsk () g What s the decso rule that mmzes the Bayes Rsk? It ca be show* that the mmum rsk ca be acheved by usg the followg decso rule: P(x ) > (C P(x ) < (C C C ) P[ ] ) P[ ] *For a tutve proof vst my lecture otes at AMU g Notce ay smlartes wth the LR? Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty

he Bayes Rsk: a example () g Cosder a classfcato problem wth two classes defed by the followg lkelhood fuctos P(x ) P(x ) π π e 3 e (x ) x 3 lkelhood 0. 0.8 0.6 0.4 0. 0. 0.08 0.06 0.04 0.0 0-6 -4-0 4 6 x g What s the decso rule that mmzes P[error]? Assume P[ ]P[ ]0.5, C C 0, C ad C 3 / Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty

he Bayes Rsk: a example () Λ(x) e e x x 3 (x ) x 3 π π e 3 > < e (x ) + (x ) > 0 < 3 > x + 0 x 4.73,.7 < x 3 > < 0. 0.8 0.6 0.4 0. 0. 0.08 0.06 0.04 0.0 R R 0-6 -4-0 4 6 x R Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 3

Varatos of the LR g he LR that mmzes the Bayes Rsk s called the Bayes Crtero Λ(x) P(x ) > P(x ) < (C (C C C ) P[ ] ) P[ ] g Maxmum A Posteror Crtero Sometmes we wll be terested mmzg P[error], whch s a specal case of the Bayes Crtero f we use a zero-oe cost fucto C j 0 j j Λ(x) P(x ) P(x ) > P( ) < P( ) P( P( x) > x) < Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 4

Varatos of the LR g Maxmum Lkelhood Fally, the smplest for of the LR s obtaed for the case of equal prors P[ ]/ ad zero-oe cost fucto: 0 j j j P( ) C C P(x ) > Λ(x) P(x ) < Whe would you wat to use a ML crtero? Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 5

Mult-class problems g he prevous decso rules were derved for twoclass problems, but geeralze gracefully for multple classes: o mmze P[error] choose the class wth hghest P[ x] argmax P( C x) o mmze Bayes rsk choose the class wth lowest R[ x] C C argmr( j x) argm CjP( j x) C j Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 6

Dscrmat fuctos () g All these decso rules have the same structure At each pot x feature space choose class whch maxmzes (or mmzes) some measure g (x) hs structure ca be formalzed wth a set of dscrmat fuctos g (x),..c, ad the followg decso rule " assg x to class f g (x) > g j(x) j " We ca the express the three basc decso rules (Bayes, MAP ad ML) terms of Dscrmat Fuctos: Crtero Dscrmat Fucto Bayes g (x)-r(α x) MAP g (x)p( x) ML g (x)p(x ) Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 7

Dscrmat fuctos () g herefore, we ca vsualze the decso rule as a etwork that computes C dscrmat fuctos ad selects the class correspodg to the largest dscrmat Class assgmet Select max Costs Dscrmat fuctos g (x) g (x) g C (x) Features x x x 3 x d Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 8

Recappg g he LR s a theoretcal result that ca oly be appled f we have complete kowledge of the lkelhoods P[x ] P[x ] geerally ukow, but ca be estmated from data If the form of the lkelhood s kow (e.g., Gaussa) the problem s smplfed b/c we oly eed to estmate the parameters of the model (e.g., mea ad covarace) g hs leads to a classfer kow as QUADRAIC, whch we cover ext If the form of the lkelhood s ukow, the problem becomes much harder, ad requres a techque kow as oparametrc desty estmato g hs techque s covered lecture 3 Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 9

Part : Quadratc classfers Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 0

Bayes classfer for Gaussa classes () g For Normally dstrbuted classes, the dscrmat fuctos reduce to very smple expressos he (multvarate) Gaussa desty ca be defed as p(x) ( π) / / exp (x µ) (x µ) Usg the Bayes rule, the MAP dscrmat fucto ca be wrtte as P(x )P( ) g (x) P( x) P(x) / ( π) / exp (x µ ) (x µ ) P( ) P(x) Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty

Bayes classfer for Gaussa classes () Elmatg costat terms akg logs g(x) -/ exp (x µ) (x µ) P() g (x) (x µ ) (x µ ) - log + ( ) log( P( )) hs s kow as a QUADRAIC dscrmat fucto (because t s a fucto of the square of x) I the ext few sldes we wll aalyze what happes to ths expresso uder dfferet assumptos for the covarace Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty

Case : Σ σ I () g hs stuato occurs whe the features are statstcally depedet, ad have the same varace for all classes I ths case, the quadratc dscrmat fucto becomes - ( σ I) (x µ ) - log( σ I ) + log( P( )) (x µ ) (x µ ) log( P( )) g (x) (x µ ) + σ Assumg equal prors ad droppg costat terms DIM (x µ ) (x µ ) - µ g (x) ( x ) hs s called a Eucldea-dstace or earest mea classfer From [Schalkoff, 99] Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 3

Case : Σ σ I () g hs s probably the smplest statstcal classfer that you ca buld: Assg a ukow example to the class whose ceter s the closest usg the Eucldea dstace x µ µ µ C Eucldea Dstace Eucldea Dstace Eucldea Dstace Mmum Selector class Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 4

Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 5 Case : Σ σ I, example [ ] [ ] [ ] 0 0 Σ 0 0 Σ 0 0 Σ 5 µ 4 7 µ 3 µ 3 3

Case : Σ Σ (Σ o-dagoal) g All the classes have the same covarace matrx, but the matrx s ot dagoal I ths case, the quadratc dscrmat becomes g (x) (x µ ) (x µ ) - log + ( ) log( P( )) Assumg equal prors ad elmatg costat terms g (x) (x µ ) Σ - (x µ ) hs s kow as a Mahalaobs-dstace classfer x Σ µ µ µ C Mahalaobs Dstace Mahalaobs Dstace Mahalaobs Dstace Mmum Selector class Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 6

he Mahalaobs dstace g he quadratc term s called the Mahalaobs dstace, a very mportat metrc SPR he Mahalaobs metrc s a vector dstace that uses a - orm - ca be thought of as a stretchg factor o the space Note that for a detty covarace matrx ( I), the Mahalaobs dstace becomes the famlar Eucldea dstace x µ x x - µ K x - µ Κ Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 7

Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 8 [ ] [ ] [ ] 0.7 0.7 Σ 0.7 0.7 Σ 0.7 0.7 Σ 5 µ 4 5 µ 3 µ 3 3 Case : Σ Σ (Σ o-dagoal), example

Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 9 [ ] [ ] [ ] 3 0.5 0.5 0.5 Σ 7 Σ Σ 5 µ 4 5 µ 3 µ 3 3 Case 3: Σ Σ j geeral case, example Zoom out

Numercal example () g Derve a lear dscrmat fucto for the twoclass 3D classfcato problem defed by /4 0 0 µ 0 0 /4 [ 0 0 0] ; µ [ ] ;Σ Σ 0 /4 0 ; p( ) p( ) g Aybody would dare to sketch the lkelhood destes ad decso boudary for ths problem? Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 30

Numercal example () g Soluto x -µ x 4 0 0 x -µ x g (x) ( x -µ ) Σ ( x -µ ) logσ + logp( ) y -µ y 0 4 0 y -µ y + logp( ) z -µ z 0 0 4 z - -µ z x - 0 4 0 0 x - 0 g(x) y - 0 0 4 0 y - 0 + log ; 3 z - 0 0 0 4 z - 0 x - g(x) y - z - 4 0 0 0 4 0 0 0 4 x - y - + log z - 3 Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 3

Numercal example (3) g Soluto (cotued) ( ) > ( + y + z ) + log - ( x ) + ( y ) + ( z ) - x > g (x) g (x) < 3 < + log 3 x + y + z > 6 log < 4.3 Classfy the test example x u [0. 0.7 0.8] > 0.+ 0.7+ 0.8.6.3 x < u Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 3

Coclusos g he Eucldea dstace classfer s Bayes-optmal* f Gaussa classes ad equal covarace matrces proportoal to the detty matrx ad equal prors g he Mahalaobs dst. classfer s Bayes-optmal f Gaussa classes ad equal covarace matrces ad equal prors *Bayes optmal meas that the classfer yelds the mmum P[error], whch s the best ANY classfer ca acheve Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 33

Part 3: Lear classfers Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 34

Lear Dscrmat Fuctos () g he objectve of ths secto s to preset methods for learg lear dscrmat fuctos of the form g ( x) w x + w 0 g g ( x) ( x) > 0 < 0 x x x w x+w 0 >0 x ( where w s the weght vector ad w 0 s the threshold or bas Smlar dscrmat fuctos were derved the prevous secto as a specal case of the quadratc classfer w x+w 0 <0 d x x ( w x I ths chapter, the dscrmat fuctos wll be derved a o-parametrc fasho, ths s, o assumptos wll be made about the uderlyg destes Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 35

Lear Dscrmat Fuctos () g For coveece, we wll focus o bary classfcato Exteso to the multcategory case ca be easly acheved by g g Usg /ot dchotomes Usg / dchotomes Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 36

Gradet descet () g Gradet descet s a geeral method for fucto mmzato From basc calculus, we kow that the mmum of a fucto J(x) s defed by the zeros of the gradet [ ] J(x) 0 x* argm J(x) x x Oly very specal cases ths mmzato fucto has a closed form soluto I some other cases, a closed form soluto may exst, but s umercally ll-posed or mpractcal (e.g., memory requremets) Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 37

Gradet descet () g Gradet descet fds the mmum a teratve fasho by movg the drecto of steepest descet. Start wth a arbtrary soluto x(0). Compute the gradet x J(x(k)) 3. Move the drecto of steepest descet: x ( k + ) x ( k ) η x J( x ( k )) 4. Go to (utl covergece) J(w) J<0 w>0 Local mmum J>0 w<0 w where η s a learg rate x 0 Ital guess Global mmum - - 0 x Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 38

Perceptro learg () g Let s ow cosder the problem of solvg a bary classfcato problem wth a lear dscrmat As usual, assume we have a dataset X{x (,x (, x (N } cotag examples from the two classes For coveece, we wll absorb the tercept w 0 by augmetg the feature vector x wth a addtoal costat dmeso: w x + w 0 x [ ] w w a y 0 From [Duda, Hart ad Stork, 00] Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 39

Perceptro learg () Keep md that our objectve s to fd a vector a such that g ( x) a > y < 0 0 x x o smplfy the dervato, we wll ormalze the trag set by replacg all examples from class by ther egatve y [ y] y g hs allows us to gore class labels ad look for a weght vector such that a y > 0 y From [Duda, Hart ad Stork, 00] Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 40

Perceptro learg (3) g o fd ths soluto we must frst defe a objectve fucto J(a) A good choce s what s kow as the Perceptro crtero J P ( a) ( a y) y Υ M g g where Y M s the set of examples msclassfed by a Note that J P (a) s o-egatve sce a y<0 for msclassfed samples Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 4

Perceptro learg (4) g o fd the mmum of J P (a), we use gradet descet he gradet s defed by a ( a) ( y) Ad the gradet descet update rule becomes a J P y ( k + ) a( k) hs s kow as the perceptro batch update rule. g he weght vector may also be updated a o-le fasho, ths s, after the presetato of each dvdual example Υ M + η y Υ M y ( k ) ( ( k ) a( k) ηy a + + Perceptro rule where y ( s a example that has bee msclassfed by a(k) Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 4

Perceptro learg (5) g If classes are learly separable, the perceptro rule s guarateed to coverge to a vald soluto g However, f the two classes are ot learly separable, the perceptro rule wll ot coverge Sce o weght vector a ca correctly classfy every sample a o-separable dataset, the correctos the perceptro rule wll ever cease Oe ad-hoc soluto to ths problem s to eforce covergece by usg varable learg rates η(k) that approach zero as k approaches fte Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 43

Perceptro learg example g Cosder the followg classfcato problem class defed by feature vectors x{[0, 0], [0, ] }; class defed by feature vectors x{[, 0], [, ] }. g Apply the perceptro algorthm to buld a vector a that separates both classes. Use learg rate η ad a(0) [, -, -]. Update the vector a o a per-example bass Preset examples the order whch they were gve above. Draw a scatterplot of the data, ad the separatg le you foud wth the perceptro rule. Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 44

Mmum Squared Error soluto () g he classcal Mmum Squared Error (MSE) crtero provdes a alteratve to the perceptro rule he perceptro rule seeks a weght vector a that satsfes the equalty a y ( >0 g he perceptro rule oly cosders msclassfed samples, sce these are the oly oes that volate the above equalty Istead, the MSE crtero looks for a soluto to the equalty a y ( b (, where b ( are some pre-specfed target values (e.g., class labels) g As a result, the MSE soluto uses ALL of the samples the trag set Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty From [Duda, Hart ad Stork, 00] 45

Mmum Squared Error soluto () g he system of equatos solved by MSE s where a s the weght vector, each row Y s a trag example, ad each row b s the correspodg class label g y y M M y ( 0 ( 0 (N 0 y y y ( ( M M (N L L L y y y ( D ( D M M (N D a a M a 0 D b b M M b For cosstecy, we wll cotue assumg that examples from class have bee replaced by ther egatve vector, although ths s ot a requremet for the MSE soluto ( ( (N Ya b Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty From [Duda, Hart ad Stork, 00] 46

Mmum Squared Error soluto (3) g A exact soluto to Yab ca sometmes be foud If the umber of (depedet) equatos (N) s equal to the umber of ukows (D+), the exact soluto s defed by g I practce, however, Y wll be sgular so ts verse Y - does ot exst a Y b Y wll commoly have more rows (examples) tha colums (ukow), whch yelds a over-determed system, for whch a exact soluto caot be foud Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 47

Mmum Squared Error soluto (4) g he soluto ths case s to fd a weght vector that mmzes some fucto of the error betwee the model (ay) ad the desred output (b) I partcular, MSE seeks to Mmze the sum of the Squares of these Errors: N ( ) ( ( ( ) a a y b Ya -b J MSE whch, as usual, ca be foud by settg ts gradet to zero Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 48

he pseudo-verse soluto () g he gradet of the objectve fucto s N ( ( ( ( a) ( a y b ) y Y ( Ya b) 0 a JMSE wth zeros defed by Ya Notce that Y Y s ow a square matrx! Y g If Y Y s osgular, the MSE soluto becomes a Y b ( ) - Y Y Y b Y b Pseudo-verse soluto where the matrx Y (Y Y) - Y s kow as the pseudo-verse of Y (Y YI) g Note that, geeral, YY I geeral Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 49

Least-mea-squares soluto () g he objectve fucto J MSE (a) Ya-b ca also be foud usg a gradet descet procedure hs avods the problems that arse whe Y Y s sgular I addto, t also avods the eed for workg wth large matrces g Lookg at the expresso of the gradet, the obvous update rule s ( + ) a( k) + η( k) Y ( b - Ya( k) ) a k It ca be show that f η(k)η()/k, where η() s ay postve costat, ths rule geerates a sequece of vectors that coverge to a soluto to Y (Ya-b)0 Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty From [Duda, Hart ad Stork, 00] 50

Least-mea-squares soluto () g he storage requremets of ths algorthm ca be reduced by cosderg each sample sequetally ( ( ) y ( ( k ) a( k) η( k) b - y ( a( k) a + + LMS rule hs s kow as the Wdrow-Hoff, least-mea-squares (LMS) or delta rule [Mtchell, 997] ( + ) a( k) + η( k) Y ( b - Ya( k) ) a k Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty From [Duda, Hart ad Stork, 00] 5