Statistical classifiers: Bayesian decision theory and density estimation

3 rd NOSE Shrt Curse Alpbach, st 6 th Mar 004 Statistical classifiers: Bayesian decisin thery and density estimatin Ricard Gutierrez- Department f Cmputer Science rgutier@cs.tamu.edu http://research.cs.tamu.edu/prism

Outline Chapter : Review f pattern classificatin Chapter : Review f prbability thery Chapter 3: Bayesian Decisin Thery Chapter 4: Quadratic classifiers Chapter 5: Kernel density estimatin Chapter 6: Nearest neighbrs Chapter 7: Perceptrn and least-squares classifiers 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - - Ricard Gutierrez-

CHAPTER : Review f pattern classificatin Features and patterns 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 3- Ricard Gutierrez-

Features and patterns () Feature Feature is any distinctive aspect, quality r characteristic Features may be symblic (i.e., clr) r numeric (i.e., height) Feature vectr: The cmbinatin f d features is represented as a d- dimensinal clumn Feature space: The d-dimensinal space defined by the feature vectr Scatter plt: Representatin f an bject cllectin in feature space x = Feature x x x d Feature vectr Class 3 x 3 x x Class Scatter plt x Feature space Class Feature 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 4- Ricard Gutierrez-

Features and patterns () Pattern Pattern is a cmpsite f traits r features characteristic f an individual In classificatin tasks, a pattern is a pair f variables {x,ω} where x is a cllectin f bservatins r features (feature vectr) ω is the cncept behind the bservatin (label) 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 5- Ricard Gutierrez-

Features and patterns (3) What makes a gd feature vectr? The quality f a feature vectr is related t its ability t discriminate examples frm different classes Examples frm the same class shuld have similar feature values Examples frm different classes have different feature values Gd features Bad features Mre feature prperties Linear separability Nn-linear separability Highly crrelated features Multi-mdal 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 6- Ricard Gutierrez-

Classifiers The task f a classifier is t partitin feature space int class-labeled decisin regins Brders between decisin regins are called decisin bundaries The classificatin f feature vectr x cnsists f determining which decisin regin it belngs t, and assign x t this class R R3 R In this lecture we will verview tw methdlgies fr designing classifiers Based n the underlying prbability density functins f the data Based n gemetric pattern-separability criteria R R R4 R3 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 7- Ricard Gutierrez-

CHAPTER : Review f prbability thery What is a prbability Prbability density functins Cnditinal prbability Bayes therem Prbabilistic reasning: a case example 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 8- Ricard Gutierrez-

Basic prbability cncepts Prbabilities are numbers assigned t events that indicate hw likely it is that the event will ccur when a randm experiment is perfrmed A prbability law fr a randm experiment is a rule that assigns prbabilities t the events in the experiment The sample space S f a randm experiment is the set f all pssible utcmes S A3 A PROBABILITY LAW prbability A4 A A A A3 A4 event 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 9- Ricard Gutierrez-

Cnditinal prbability () If A and B are tw events, the prbability f event A when we already knw that event B has ccurred is defined by the relatin P[A IB] P[A B] = fr P[B] > P[B] 0 This cnditinal prbability P[A B] is read: the cnditinal prbability f A cnditined n B, r simply the prbability f A given B 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 0- Ricard Gutierrez-

Cnditinal prbability () Interpretatin The new evidence B has ccurred has the fllwing effects The riginal sample space S (the whle square) becmes B (the rightmst circle) The event A becmes A B P[B] simply re-nrmalizes the prbability f events that ccur jintly with B S S A A B B B has A A B B ccurred 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - - Ricard Gutierrez-

Therem f ttal prbability Let B, B,, B N be a partitin f S, a set f mutually exclusive events such that B B 3 B N- A S = B UB U... U BN B B 4 B N Any event A can then be represented as: A = A I S = A I(B UB U... UBN) = (A IB) U(A IB ) U...(A IB Since B, B,, B N are mutually exclusive then, by Axim III: P[A] = P[A I B ] + P[A IB] +... + P[A and, therefre P[A] = P[A B I N ]P[B] +...P[A BN]P[BN] = P[A Bk ]P[Bk ] k= B N ] N ) 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - - Ricard Gutierrez-

Bayes therem Given {B, B,, B N }, a partitin f the sample space S. Suppse that event A ccurs; what is the prbability f event B j? Using the definitin f cnditinal prbability and the therem f ttal prbability we btain P[B j P[A IB j] P[A B j] P[B j] A] = = N P[A] P[A B ] P[B This is knwn as Bayes therem r Bayes rule, and is (ne f) the mst useful relatins in prbability and statistics k= k k ] 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 3- Ricard Gutierrez-

Applying Bayesian therem () Cnsider a clinical prblem where we need t decide if a patient has a particular medical cnditin n the basis f an imperfec test: Smene with the cnditin may g undetected (false-negative) Smene free f the cnditin may yield a psitive result (falsepsitive) Nmenclature SPECIFICITY: The true-negative rate P(NEG COND) f a test SENSITIVITY: The true-psitive rate P(POS COND) f a test 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 4- Ricard Gutierrez-

Applying Bayesian therem () PROBLEM Assume a ppulatin f 0,000 where ut f every 00 peple has the medical cnditin Assume that we design a test with 98% specificity P(NEG COND) and 90% sensitivity P(POS COND) Assume yu take the test, and it yields a POSITIVE result What is the prbability that yu have the medical cnditin? 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 5- Ricard Gutierrez-

Applying Bayesian therem (3) SOLUTION A: Fill in the jint frequency table belw The answer is the rati f individuals with the cnditin t ttal individuals (cnsidering nly individuals that tested psitive) r 90/88=0.35 TEST IS POSITIVE TEST IS NEGATIVE ROW TOTAL HAS CONDITION True-psitive P(POS COND) False-negative P(NEG COND) 00 0.90=90 00 (-0.90)=0 00 False-psitive True-negative FREE OF CONDITION P(POS COND) P(NEG COND) 9,900 (-0.98)=98 9,900 0.98=9,07 9,900 COLUMN TOTAL 88 9,7 0,000 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 6- Ricard Gutierrez-

Applying Bayesian therem (4) SOLUTION B: Apply Bayes therem P[COND POS] = = P[POS COND] P[COND] P[POS] = = P[POS COND] P[COND] P[POS COND] P[COND] + P[POS COND] P[ COND] = 0.90 0.0 = 0.90 0.0+ ( 0.98) 0.99 = = 0.35 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 7- Ricard Gutierrez-

Bayes therem and pattern classificatin Fr the purpse f pattern classificatin, Bayes therem is nrmally expressed as P[ ω x] = j N k= P[x ω ] P[ ω ] P[x ω ] P[ ω ] where ω j is the i th class and x is the feature vectr j k j k P[x ω j] P[ ω j] = P[x] Bayes therem is relevant because, as we will see in a minute, a sensible classificatin rule is t chse the class ω i with the highest P[ω i x] This represents the intuitive ratinale f chsing the class that is mre likely given the bserved feature vectr x 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 8- Ricard Gutierrez-

Bayes therem and pattern classificatin Each term in the Bayes therem has a special name, which yu shuld becme familiarized with P[ ω Prir prbability (f class ω i ) j] P[ ω Psterir Prbability (f class ω i given the j x] bservatin x) P[x ω j] Likelihd (cnditinal prbability f bservatin x given class ω i ) P[x] A nrmalizatin cnstant (des nt affect the decisin) 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 9- Ricard Gutierrez-

CHAPTER 3: Bayesian Decisin Thery The Likelihd Rati Test The Prbability f Errr The Bayes Risk Bayes, MAP and ML Criteria Multi-class prblems Discriminant Functins 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 0- Ricard Gutierrez-

The Likelihd Rati Test () Assume we are t classify an bject based n the evidence prvided by a measurement (r feature vectr) x Wuld yu agree that a reasnable decisin rule wuld be the fllwing? "Chse the class that is mst prbable given the bserved feature vectr x Mre frmally: Evaluate the psterir prbability f each class P(ω i x) and chse the class with largest P(ω i x) 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - - Ricard Gutierrez-

The Likelihd Rati Test () Let us examine this decisin rule fr a -class prblem In this case the decisin rule becmes if P(ω x) > P(ω x) else chse ω chse ω Or, in a mre cmpact frm Applying Bayes therem ω P(ω x) < > P(ω ω x) ω P (x ω )P(ω ) P(x ω)p(ω P(x) < > P(x) ω ) 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - - Ricard Gutierrez-

The Likelihd Rati Test (3) P(x) des nt affect the decisin rule s it can be eliminated*. Rearranging the previus expressin Λ(x) = ω P(x ω ) P(x ω ) < > ω P(ω ) P(ω ) The term Λ(x) is called the likelihd rati, and the decisin rule is knwn as the likelihd rati test *P(x) can be disregarded in the decisin rule since it is cnstant regardless f class ω i. Hwever, P(x) will be needed if we want t estimate the psterir P(ω i x) which, unlike P(x ω i )P(x), is a true prbability value and, therefre, gives us an estimate f the gdness f ur decisin. 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 3- Ricard Gutierrez-

Likelihd Rati Test: an example () Given a classificatin prblem with the fllwing class cnditinal densities: P(x ω ) = P(x ω ) = e π e π (x 4) (x 0) P(x ω ) P(x ω ) 4 0 x Derive a classificatin rule based n the Likelihd Rati Test (assume equal prirs) 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 4- Ricard Gutierrez-

Likelihd Rati Test: an example () Slutin Substituting the given likelihds and prirs int the LRT expressin: Λ(x) = π π e e (x 4) ω (x 0) > < ω Simplifying, changing signs and taking lgs: Which yields: ω < x 7 > ω This LRT result makes intuitive sense since the likelihds are identical and differ nly in their mean value (x 4) (x 0) ω < 0 > R : say ω R : say ω P(x ω ) P(x ω ) ω 4 0 x 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 5- Ricard Gutierrez-

The prbability f errr Prb. f errr is the prbability f assigning x t the wrng class Fr a tw-class prblem, P[errr x] is simply P(errr x) = P(ω P(ω x) x) if if we decide we decide ω ω It makes sense that the classificatin rule be designed t minimize the average prb. f errr P[errr] acrss all pssible values f x + P(errr) = P(errr,x)dx = P(errr x)p(x)dx T minimize P(errr) we minimize the integrand P(errr x) at each x: chse the class with maximum psterir P(ω i x) This is called the MAXIMUM A POSTERIORI (MAP) RULE + 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 6- Ricard Gutierrez-

Minimizing prbability f errr We prve the ptimality f the MAP rule graphically The right plt shws the psterir fr each f the tw classes The bttm plts shws the P(errr) fr the MAP rule and an alternative decisin rule Which ne has lwer P(errr) (clr-filled area)? P(w i x) x THE MAP RULE THE OTHER RULE Chse RED Chse BLUE Chse RED Chse RED Chse BLUE Chse RED 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 7- Ricard Gutierrez-

The Bayes Risk () S far we have assumed that the penalty f misclassifying a class ω example as class ω is the same as the reciprcal In general, this is nt the case: Fr example, misclassifying a cancer sufferer as a healthy patient is a much mre serius prblem than the ther way arund Misclassifying salmn as sea bass has lwer cst (unhappy custmers) than the ppsite errr This cncept can be frmalized in terms f a cst functin C ij C ij represents the cst f chsing class ω i when class ω j is the true class We define the Bayes Risk as the expected value f the cst R = E[C] = i= j= C ij P[chse ωi and x ω j] = Cij P[x Ri ω j] P[ω j] i= j= 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 8- Ricard Gutierrez-

The Bayes Risk () What is the decisin rule that minimizes the Bayes Risk? It can be shwn* that the minimum risk can be achieved by using the fllwing decisin rule: P(x ω ) > (C P(x ω ) < (C ) P[ω ] ) P[ω ] *Fr an intuitive prf visit my lecture ntes at TAMU ω ω C C Ntice any similarities with the LRT? 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 9- Ricard Gutierrez-

The Bayes Risk: an example () Cnsider a classificatin prblem with tw classes defined by the fllwing likelihd functins P(x ω ) = P(x ω ) = π π e 3 e (x ) x 3 likelihd 0. 0.8 0.6 0.4 0. 0. 0.08 0.06 0.04 0.0 What is the decisin rule that minimizes P[errr]? 0-6 -4-0 4 6 x Assume P[ω ]=P[ω ]=0.5, C =C =0, C = and C =3 / 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 30- Ricard Gutierrez-

The Bayes Risk: an example () Λ(x) = e e x x 3 (x ) x 3 π π ω e 3 > < ω ω e (x ) + (x ) > 0 < 3 > x + 0 x = 4.73,.7 < ω ω ω x 3 ω ω 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 3- > < 0. 0.8 0.6 0.4 0. 0. 0.08 0.06 0.04 0.0 R 0-6 -4-0 4 6 x Ricard Gutierrez- R R

Variatins f the LRT The LRT that minimizes the Bayes Risk is called the Bayes Criterin ω P(x ω ) > (C Λ(x) = P(x ω ) < (C ω C C ) P[ω ] ) P[ω ] Bayes criterin Many times we will simply be interested in minimizing P[errr], which is a special case f the Bayes Criterin if we use a zer-ne cst functin This versin f the LRT is referred t as the Maximum A Psteriri Criterin, since it seeks t maximize the psterir P(ω i x) C ij 0 = i = j i j ω P(x ω ) > P(ω ) Λ(x) = P(x ω ) < P(ω ) ω Finally, fr the case f equal prirs P[ω i ]=/ and zer-ne cst functin, the LTR is called the Maximum Likelihd Criterin, since it will maximize the likelihd P(x ω i ) 0 i = j ij = i j P(ω i) = i C P(ω P(ω ω C P(x ω ) > Λ(x) = P(x ω ) < ω ω x) > Maximum A Psteriri x) < (MAP) Criterin ω Maximum Likelihd (ML) Criterin 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 3- Ricard Gutierrez-

Multi-class prblems The previus decisins were derived fr tw-class prblems, but generalize gracefully fr multiple classes: T minimize P[errr] chse the class ω i with highest P[ω i x] ω i = argmax P(ω i C i x) T minimize Bayes risk chse the class ω i with lwest R[ω i x] i C i C ωi = argminr(ω j x) = argmin CijP(ω j x) C j= 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 33- Ricard Gutierrez-

Discriminant functins () Nte that all the decisin rules have the same structure At each pint x in feature space chse class ω i which maximizes (r minimizes) sme measure g i (x) This structure can be frmalized with a set f discriminant functins g i (x), i=..c, and the fllwing decisin rule " assign x t class ωi if g i(x) > g j(x) j i" We can then express the three basic decisin rules (Bayes, MAP and ML) in terms f Discriminant Functins: Criterin Discriminant Functin Bayes g i (x)=-r(α i x) MAP g i (x)=p(ω i x) ML g i (x)=p(x ω i ) 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 34- Ricard Gutierrez-

Discriminant functins () Therefre, we can visualize the decisin rule as a netwrk that cmputes C discriminant functins and selects the categry crrespnding t the largest discriminant Class assignment Select max Csts Discriminant functins g (x) g (x) g C (x) Features x x x 3 x d 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 35- Ricard Gutierrez-

Recapping The LRT is a theretical result that can nly be applied if we have cmplete knwledge f the likelihds P[x ω i ] P[x ω i ] generally unknwn, but can be estimated frm data If the frm f the likelihd is knwn (e.g., Gaussian) the prblem is simplified b/c we nly need t estimate the parameters f the mdel (e.g., mean and cvariance) This leads t a classifier knwn as QUADRATIC, which we cver next If the frm f the likelihd is unknwn, the prblem becmes much harder, and requires a technique knwn as nn-parametric density estimatin This technique is cvered in the final chapters f this lecture 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 36- Ricard Gutierrez-

CHAPTER 4: Quadratic classifiers Bayes classifiers fr nrmally distributed classes The Euclidean-distance classifier The Mahalanbis-distance classifier Numerical example 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 37- Ricard Gutierrez-

The Nrmal r Gaussian distributin Remember that the univariate Nrmal distributin N(µ,σ) is 0.35 0.3 f (x) = X X -µ exp πσ σ Similarly, the multivariate Nrmal distributin N(µ,Σ) is defined as p(x) 0.5 0. 0.5 0. 0.05 µ=; σ=3 µ=6; σ= fx(x) = n/ ( π ) / exp (X µ) T (X µ) -6-4 - 0 4 6 8 0 4 x Gaussian pdfs are very ppular since The parameters (µ,σ) are sufficient t uniquely characterize the pdf If the x i s are mutually uncrrelated (c ik =0), then they are als independent The cvariance matrix becmes a diagnal matrix, with the individual variances in the main diagnal 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 38- x Ricard Gutierrez- 8 6 4 0 - -4-0 4 6 x

Cvariance matrix The cvariance matrix indicates the tendency f each pair f features (dimensins in a randm vectr) t vary tgether, i.e., t c-vary* The cvariance has several imprtant prperties If x i and x k tend t increase tgether, then c ik >0 If x i tends t decrease when x k increases, then c ik <0 If x i and x k are uncrrelated, then c ik =0 c ik σ i σ k, where σ i is the standard deviatin f x i c ii = σ i = VAR(x i ) The cvariance terms can be expressed as where ρ ik is called the crrelatin cefficient c ii i = σ and c = ρ ik ik σ σ i k X k X k X k X k X k X i X i X i X i X i C ik =-σ i σ k ρ ik =- C ik =-½σ i σ k ρ ik =-½ C ik =0 ρ ik =0 C ik =+½σ i σ k ρ ik =+½ C ik =σ i σ k ρ ik =+ 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 39- Ricard Gutierrez- *frm http://www.engr.sjsu.edu/~knapp/hcirodpr/pr_hme.htm

Bayes classifier fr Gaussian classes () Fr Nrmally distributed classes, the DFs can be reduced t very simple expressins The (multivariate) Gaussian density can be defined as p(x) = n/ ( π) / exp (x µ) T (x µ) Using Bayes rule, the MAP DF can be written as P(x ωi )P( ωi ) g i(x) = P( ωi x) = = P(x) = n/ ( π) / i exp (x µ ) i T i (x µ i ) P( ωi ) P(x) 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 40- Ricard Gutierrez-

Bayes classifier fr Gaussian classes () Eliminating cnstant terms Taking lgs g (x) = i i -/ exp (x µ i) T i (x µ i) P(ω i) T gi (x) = (x µ i) i (x µ i) - lg i + ( ) lg( P(ω )) This is knwn as a QUADRATIC discriminant functin (because it is a functin f the square f x) i In the next few slides we will analyze what happens t this expressin under different assumptins fr the cvariance 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 4- Ricard Gutierrez-

Case : Σ i =σ I () This situatin ccurs when the features are statistically independent, and have the same variance fr all classes In this case, the quadratic discriminant functin becmes - T ( σ I) (x µ ) - lg( σ I ) + lg( P(ω )) = (x µ ) (x µ ) lg( P(ω )) T gi (x) = (x µ i ) i i i i + σ Assuming equal prirs and drpping cnstant terms i DIM T i = (x µ i ) (x µ i ) = - µ i= g (x) ( xi i ) This is called an Euclidean-distance r nearest mean classifier Frm [Schalkff, 99] 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 4- Ricard Gutierrez-

Case : Σ i =σ I () This is prbably the simplest statistical classifier that yu can build: Assign an unknwn example t the class whse center is the clsest using the Euclidean distance x µ µ µ C Euclidean Distance Euclidean Distance Euclidean Distance Minimum Selectr class Hw valid is the assumptin Σ i =σ I in chemical sensr arrays? 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 43- Ricard Gutierrez-

Ricard Gutierrez- 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 44- Case : Σ i =σ I, example [ ] [ ] [ ] = = = = = = 0 0 Σ 0 0 Σ 0 0 Σ 5 µ 4 7 µ 3 µ 3 T 3 T T

Case : Σ i =Σ (Σ nn-diagnal) All the classes have the same cvariance matrix, but the matrix is nt diagnal In this case, the quadratic discriminant becmes g (x) = (x µ i ) (x µ i ) - lg T i + ( ) lg( P(ω )) Assuming equal prirs and eliminating cnstant terms i T - gi(x) = (x µ i ) Σ (x µ i ) µ This is knwn as a Mahalanbis-distance classifier x Σ µ µ C Mahalanbis Distance Mahalanbis Distance Mahalanbis Distance Minimum Selectr class 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 45- Ricard Gutierrez-

The Mahalanbis distance The quadratic term is called the Mahalanbis distance, a very imprtant metric in Statistical Pattern Recgnitin (right up there with Bayes therem) The Mahalanbis distance is a vectr distance that uses a - nrm - can be thught f as a stretching factr n the space Nte that fr an identity cvariance matrix ( =I), the Mahalanbis distance becmes the familiar Euclidean distance x µ x xi - µ = K xi - µ = Κ 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 46- Ricard Gutierrez-

Case : Σ i =Σ (Σ nn-diagnal), example µ Σ T T [ 3 ] µ = [ 5 4] µ = [ 5] = = 0.7 0.7 Σ = 0.7 0.7 Σ 3 3 = 0.7 T 0.7 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 47- Ricard Gutierrez-

Ricard Gutierrez- 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 48- [ ] [ ] [ ] = = = = = = 3 0.5 0.5 0.5 Σ 7 Σ Σ 5 µ 4 5 µ 3 µ 3 T 3 T T Case 3: Σ i Σ j general case, example Zm ut

Numerical example () Derive a linear discriminant functin fr the tw-class 3D classificatin prblem defined by /4 0 0 µ = 0 0 /4 T T [ 0 0 0] ; µ = [ ] ; Σ = Σ = 0 /4 0 ; p( ω ) p( ) = ω Anybdy wuld dare t sketch the likelihd densities and decisin bundary fr this prblem? 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 49- Ricard Gutierrez-

Numerical example () Slutin x -µ x 4 0 0 x -µ x T gi (x) = ( x -µ i ) Σ ( x -µ i ) lgσ + lgp(ω i) y -µ y 0 4 0 y -µ y + lgp(ω i) z -µ z 0 0 4 z - -µ z T T x - 0 4 0 0 x - 0 g(x) = y - 0 0 4 0 y - 0 + lg ; 3 z - 0 0 0 4 z - 0 x - g(x) = y - z - T 4 0 0 0 4 0 0 0 4 x - y - + lg z - 3 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 50- Ricard Gutierrez-

Numerical example (3) Slutin (cntinued) - ( ) > ( x + y + z ) + lg - ( x ) + ( y ) + ( z ) 3 > g(x) g < ω < ω ω ω (x) + lg 3 x + y + z ω ω > 6 lg < 4 =.3 Classify the test example x u =[0. 0.7 0.8] T > 0.+ 0.7 + 0.8 =.6.3 x < ω ω u ω 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 5- Ricard Gutierrez-

Cnclusins The Euclidean distance classifier is Bayes-ptimal* fr Gaussian classes and equal cvariance matrices prprtinal t the identity matrix and equal prirs The Mahalanbis distance classifier is Bayes-ptimal fr Gaussian classes and equal cvariance matrices and equal prirs *Bayes ptimal means that the classifier yields the minimum P[errr], which is the best ANY classifier can achieve 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 5- Ricard Gutierrez-

CHAPTER 5: Kernel Density Estimatin Histgrams Parzen Windws Smth Kernels The Naïve Bayes Classifier 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 53- Ricard Gutierrez-

Nn-parametric density estimatin (NPDE) In the previus tw chapters we have assumed that either The likelihds p(x ω i ) were knwn (Likelihd Rati Test) r At least, the parametric frm f the likelihds were knwn (Parameter Estimatin) The methds that will be presented in the next tw chapters d nt affrd such luxuries Instead, they attempt t estimate the density directly frm the data withut making assumptins abut the underlying distributin Sunds challenging? Yu bet! 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 54- Ricard Gutierrez-

The histgram The simplest frm f NPDE is the familiar histgram Divide the sample space int a number f bins and apprximate the density at the center f each bin by the fractin f pints in the training data that fall int the crrespnding bin P H (x) = (k [ number f x in same bin as x] N [ width f bin cntaining x] 0.6 0.4 0. 0. p(x) 0.08 0.06 0.04 0.0 0 0 4 6 8 0 4 6 x 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 55- Ricard Gutierrez-

Shrtcmings f the histgram The shape f the NPDE depends n the starting psitin f the bins Fr multivariate data, the final shape f the NDPE als depends n the rientatin f the bins The discntinuities are nt due t the underlying density, they are nly an artifact f the chsen bin lcatins These discntinuities make it very difficult, withut experience, t grasp the structure f the data A much mre serius prblem is the curse f dimensinality: the number f bins grws expnentially with the number f dimensins In high dimensins we wuld require a very large number f examples r else mst f the bins wuld be empty All these drawbacks make the histgram unsuitable fr mst practical applicatins except fr rapid visualizatin f results in ne r tw dimensins 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 56- Ricard Gutierrez-

NPDE, general frmulatin () Let us return t the basic definitin f prbability t get a slid idea f what we are trying t accmplish The prbability that a vectr x, drawn frm a distributin p(x), will fall in a given regin R f the sample space is P = p(x' )dx' R p(x) 0.6 0.4 0. 0. 0.08 0.06 0.04 0.0 R 0 0 4 6 8 0 4 6 x 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 57- Ricard Gutierrez- Frm [Bishp, 995]

NPDE, general frmulatin () Suppse nw that N vectrs {x (, x (,, x (N } are drawn frm the distributin; the prbability that k f these N vectrs fall in R is nw given by the binmial distributin Prb k N k k k ( ) P ( P) N = It can be shwn (frm the prperties f the binmial) that the mean and variance f the rati k/n are k E N = P and k Var N = E P ( P) Nte that the variance gets smaller as N, s we can expect that a gd estimate f P is the mean fractin f pints that fall within R P k N k N P = N 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 58- Ricard Gutierrez- Frm [Bishp, 995]

NPDE, general frmulatin (3) Assume nw that R is s small that p(x) des nt vary appreciably within it, then the integral can be apprximated by p(x' )dx' p(x)v R where V is the vlume enclsed by regin R 0.6 0.4 0. 0. p(x) 0.08 0.06 0.04 0.0 R 0 0 4 6 8 0 4 6 x 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 59- Ricard Gutierrez- Frm [Bishp, 995]

NPDE, general frmulatin (4) Merging the tw expressins we btain P = p(x' )dx' R P p(x)v p(x) k N k NV This estimate becmes mre accurate as we increase the number f sample pints N and shrink the vlume V In practice the value f N (the ttal number f examples) is fixed T imprve the estimate p(x) we culd let V apprach zer but then regin R wuld becme s small that it wuld enclse n examples This means that, in practice, we will have t find a cmprmise value fr the vlume V Large enugh t include enugh examples within R Small enugh t supprt the assumptin that p(x) is cnstant within R 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 60- Ricard Gutierrez- Frm [Bishp, 995]

NPDE, general frmulatin (5) In cnclusin, the general expressin fr NPDE is p(x) k NV where V is the vlume surrunding x N is the ttal number f examples k is the number f examples inside V When applying this result t practical density estimatin prblems, tw basic appraches can be adpted Kernel Density Estimatin (KDE): Chse a fixed value f the vlume V and determine k frm the data k Nearest Neighbr (knn): Chse a fixed value f k and determine the crrespnding vlume V frm the data It can be shwn that bth KDE and knn cnverge t the true prbability density as N, prvided that V shrinks with N, and k grws with N apprpriately 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 6- Ricard Gutierrez- Frm [Bishp, 995]

Parzen windws () Suppse that the regin R that enclses the k examples is a hypercube f side h Then its vlume is given by V=h D, where D is the number f dimensins h x T find the #examples that fall within this regin we define a kernel functin K(u) h h K ( u) = uj < / j =,..,D 0 therwise K(u) This kernel, which crrespnds t a unit hypercube centered at the rigin, is knwn as a Parzen windw r the naïve estimatr -h/ h/ u 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 6- Ricard Gutierrez- Frm [Bishp, 995]

Parzen windws () x (4 x (3 x x ( x ( / V The ttal number f pints inside the hypercube is then k = N n= x x K h Substituting back int the density estimate expressin (n Vlume x ( x ( x (3 K(x-x ( )= K(x-x ( )= K(x-x (3 )= p KDE (x) = Nh D N n= x x K h (n x (4 K(x-x (4 )=0 x (4 ( ) Nte that the Parzen windw DE resembles the histgram, with the exceptin that the bin lcatins are determined by the data pints k = N n = K x x (n 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 63- Ricard Gutierrez- Frm [Bishp, 995]

Numerical exercise () Given the dataset X belw, use Parzen windws t estimate the density p(x) at y=3, 0, 5. X = {x (, x (, x (N } = {4, 5, 5, 6,, 4, 5, 5, 6, 7} Use a bandwidth f h=4 p(x) y=3 y=0 y=5 5 0 5 x 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 64- Ricard Gutierrez-

Numerical exercise () Slutin: Let s first estimate p(y=3): p KDE Similarly N y x (y = 3) = K D Nh n= h = 0 4 = 0 4 (y = 0) = 0 4 (n = 3 4 3 5 3 5 3 6 3 7 K + K + K + K +... + K = 4 4 4 4 4 443 443 443 443 443 -/4 -/ -/ - -3/4 0 4 [ + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0] = = 0.05 0 0 4 [ 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0] = 0 pkde = (y = 5) = 0 4 4 0 4 [ 0 + 0 + 0 + 0 + 0 + + + + + 0] = 0. pkde = 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 65- Ricard Gutierrez-

Smth kernels () The Parzen windw has several drawbacks Yields density estimates that have discntinuities Weights equally all the pints x i, regardless f their distance t the estimatin pint x Sme f these difficulties can be vercme by replacing the Parzen windw with a smth kernel K(u) such that Parzen(u) A= D R K ( x) dx = K(u) A= -/ -/ u -/ -/ u 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 66- Ricard Gutierrez-

Smth kernels () Usually, but nt nt always, K(u) will be a radially symmetric and unimdal prbability density functin, such as the multivariate Gaussian density functin K ( x) = x T exp D / x ( π) where the expressin f the density estimate remains the same as with Parzen windws p KDE (x) = Nh D N n= K x x h (n 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 67- Ricard Gutierrez-

Smth kernels (3) Just as the Parzen windw DE can be cnsidered a sum f bxes centered at the bservatins, the smth kernel estimate is a sum f bumps placed at the data pints The kernel functin determines the shape f the bumps The parameter h, als called the smthing parameter r bandwidth, determines their width P K DE(x); h=3 0.045 0.04 0.035 0.03 0.05 0.0 0.05 Density estimate Kernel functins 0.005 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 68-0.0 0-0 -5 0 5 0 5 0 5 30 35 40 x Ricard Gutierrez- Data pints

Bandwidth selectin, univariate case () P KDE (x); h=.0 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.0 0.0 0-0 -5 0 5 0 5 0 5 30 35 40 x 0.035 P KDE (x); h=.5 0.05 0.045 0.04 0.035 0.03 0.05 0.0 0.05 0.0 0.005 0-0 -5 0 5 0 5 0 5 30 35 40 x 0.03 0.03 0.05 P KDE (x); h=5.0 0.05 0.0 0.05 0.0 0.005 P KDE (x); h=0.0 0.0 0.05 0.0 0.005 0-0 -5 0 5 0 5 0 5 30 35 40 x 0-0 -5 0 5 0 5 0 5 30 35 40 x 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 69- Ricard Gutierrez-

Bandwidth selectin, univariate case () Subjective chice Plt ut several curves and chse the estimate that is mst in accrdance with ne s prir (subjective) ideas Hwever, this methd is nt practical in pattern recgnitin since we typically have high-dimensinal data Reference t a standard distributin Assume a standard density functin and find the value f the bandwidth that minimizes the integral f the square errr (MISE) h pt { [ KDE ]} { MISE( p ( x) )} = argmin E ( p ( x) p( x) ) = argmin dx h If we assume that the true distributin is a Gaussian and we use a Gaussian kernel, it can be shwn that the ptimal bandwidth is h KDE pt =.06σN where σ is the sample variance and N is the number f training examples h / 5 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 70- Ricard Gutierrez- Frm [Silverman, 986]

Bandwidth selectin, univariate case (3) Likelihd crss-validatin The ML estimate f h is degenerate since it yields h ML =0, a density estimate with Dirac delta functins at each training data pint A practical alternative is t maximize the pseud-likelihd cmputed using leave-ne-ut crss-validatin p - (x) N h MLCV = argmax lgp n ; h N n= p - (x ( ) (n (n ( x ) where p ( x ) p -3 (x) n = N ( ) N h m=,n p -3 (x (3 ) m x K (n x h (m p - (x) x ( p - (x ( ) x p -4 (x) x (3 p -4 (x (4 ) x x ( 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 7- x x (4 Ricard Gutierrez- x Frm [Silverman, 986]

Multivariate density estimatin Bandwidth needs t be selected individually fr each axis Alternative, ne may pre-scale axes r whiten the data, s that the same bandwidth can be used fr all dimensins Density can be estimated with a multivariate kernel r by means f s-called prduct kernels (see TAMU ntes) 0.5 0. 0.05 x 0-0.05-0. -0.5 99 9 8 9* 9* 5 555 3 5 5 9 8 98 555 3* 33 3* 5*5* 8* 9* 9*9* 8 5* 8* 8* 3 33 3 * * 3* 3*3* ** * * 3* 3* * 6* 444 444 4 * * 6 4 4 6 6* 4* 6* 6 4 4 6* 6*6* 6* 7* 7 7 7 7 77 7 7* 7*7* 0* 00 0* 00 0* 0* 0.4 0.5 0.6 0.7 0.8 0.9. PRODUCT KERNELS P(x, x ω i ) x x x 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 7- Ricard Gutierrez-

Naïve Bayes classifier () Hw d we apply KDE t classifier design? First, we estimate the likelihd f each class P(x ω i ) Then we apply Bayes rule t derive the MAP rule g (x) i = P(ω i x) P(x ω )P(ω ) i i Hwever, P(x ω i ) is multivariate: NPDE becmes hard!! T avid this prblem, ne practical simplificatin is smetimes made: assume that the features are classcnditinally independent P(x ω ) D i = P(x(d) ωi) d= 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 73- Ricard Gutierrez-

Naïve Bayes classifier () Class-cnditinal independence vs. independence x x x P(x ω ) D i P(x(d) ω i ) d = P(x ω ) P(x) D i = P(x(d) ω i ) d = D d = P(x(d)) x P(x ω ) P(x) x D i = P(x(d) ω i ) d = D d = P(x(d)) 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 74- Ricard Gutierrez-

Naïve Bayes classifier (3) Merging this expressin int the discriminant functin yields the decisin rule fr the Naïve Bayes classifier g i,nb (x) = P(ω ) i D d= ( ω ) P x(d) i Naïve Bayes Classifier The main advantage f the Naïve Bayes classifier is that we nly need t cmpute the univariate densities P(x(d) ω i ), which is a much easier prblem than estimating the multivariate density P(x ω i ) Despite its simplicity, the Naïve Bayes has been shwn t have cmparable perfrmance t artificial neural netwrks and decisin tree learning in sme dmains 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 75- Ricard Gutierrez-

CHAPTER 6: Nearest Neighbrs Nearest Neighbrs density estimatin The k Nearest Neighbrs classificatin rule knn as a lazy learner Characteristics f the knn classifier Optimizing the knn classifier 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 76- Ricard Gutierrez-

knn Density Estimatin () In the knn methd we grw the vlume surrunding the estimatin pint x until it enclses a ttal f k data pints The density estimate then becmes P(x) k NV = N c k R D D k (x) x R Vl=πR k P(x) = Nπ R R k (x) is the distance between the estimatin pint x and its k-th clsest neighbr c D is the vlume f the unit sphere in D dimensins: c D = D/ D/ ( D/)! Γ( D/ + ) Thus c =, c =π, c 3 =4π/3 and s n π = π 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 77- Ricard Gutierrez-

knn Density Estimatin () In general, the estimates that can be btained with the knn methd are nt very satisfactry The estimates are prne t lcal nise The methd prduces estimates with very heavy tails Since the functin R k (x) is nt differentiable, the density estimate will have discntinuities These prperties are illustrated in the next few slides 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 78- Ricard Gutierrez-

knn Density Estimatin, example T illustrate knn we generated several DEs fr a univariate mixture f tw Gaussians: P(x)=½N(0,)+½N(0,4) and several values f N and k 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 79- Ricard Gutierrez-

knn Density Estimatin, example (a) The perfrmance f the knn density estimatin technique n tw dimensins is illustrated in these figures The tp figure shws the true density, a mixture f tw bivariate Gaussians p(x) = N µ,σ µ = with µ = ( ) + N ( µ,σ ) [ 0 5] T [ 5 0] T = = 4 The bttm figure shws the density estimate fr k=0 neighbrs and N=00 examples In the next slide we shw the cnturs f the tw distributins verlapped with the training data used t generate the estimate Σ Σ 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 80- Ricard Gutierrez-

knn Density Estimatin, example (b) True density cnturs knn density estimate cnturs 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 8- Ricard Gutierrez-

knn as a Bayesian classifier () The main advantage f the knn methd is that it leads t a very simple apprximatin f the Bayes classifier Assume that we have a dataset with N examples, N i frm class ω i, and that we are interested in classifying an unknwn sample x u We draw a hyper-sphere f vlume V arund x u. Assume this vlume cntains a ttal f k examples, k i frm class ω i. The uncnditinal density is estimated by P(x) = k NV Frm [Bishp, 995] 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 8- Ricard Gutierrez-

knn as a Bayesian classifier () Similarly, we can then apprximate the likelihd functins by cunting the number f examples f each class inside vlume V: P(x ω ) i = ki N V And the prirs are apprximated by Ni P( ω i ) = N Putting everything tgether, the Bayes classifier becmes i P(ω x) = P(x ω i)p(ω i) P(x) ki Ni NiV N k NV i = = ki k Frm [Bishp, 995] 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 83- Ricard Gutierrez-

The knn classificatin rule () The K Nearest Neighbr Rule (knn) is a very intuitive methd that classifies unlabeled examples based n their similarity t examples in the training set Fr a given unlabeled example x u R D, find the k clsest labeled examples in the training data set and assign x u t the class that appears mst frequently within the k-subset The knn nly requires An integer k A set f labeled examples (training data) A metric t measure clseness 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 84- Ricard Gutierrez-

The knn classificatin rule () Example In the example belw we have three classes: the gal is t find a class label fr the unknwn example x u In this case we use the Euclidean distance and a value f k=5 neighbrs Of the 5 clsest neighbrs, 4 belng t ω and belngs ω ω t ω 3, s x u is assigned t ω, x u the predminant class ω 3 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 85- Ricard Gutierrez-

knn in actin: example We have generated data fr a -dimensinal 3-class prblem, where the class-cnditinal densities are multimdal, and nn-linearly separable, as illustrated in the figure We used the knn rule with k = 5 The Euclidean distance as a metric The resulting decisin bundaries and decisin regins are shwn belw 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 86- Ricard Gutierrez-

knn in actin: example We have generated data fr a -dimensinal 3-class prblem, where the class-cnditinal densities are unimdal, and are distributed in rings arund a cmmn mean. These classes are als nn-linearly separable, as illustrated in the figure We used the knn rule with k = 5 The Euclidean distance as a metric The resulting decisin bundaries and decisin regins are shwn belw 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 87- Ricard Gutierrez-

Characteristics f the knn classifier () Advantages Simple implementatin Nearly ptimal in the large sample limit (N ) P[errr] Bayes <P[errr] NN <P[errr] Bayes Uses lcal infrmatin, which can yield highly adaptive behavir Lends itself very easily t parallel implementatins Disadvantages Large strage requirements Cmputatinally intensive recall Highly susceptible t the curse f dimensinality 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 88- Ricard Gutierrez-

Characteristics f the knn classifier () NN versus knn The use f large values f k has tw main advantages Yields smther decisin regins Prvides prbabilistic infrmatin The rati f examples fr each class gives infrmatin abut the ambiguity f the decisin Hwever, t large a value f k is detrimental It destrys the lcality f the estimatin, since farther examples are taken int cnsideratin In additin, it increases the cmputatinal burden 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 89- Ricard Gutierrez-

knn versus NN -NN 5-NN 0-NN 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 90- Ricard Gutierrez-

knn and the prblem f feature weighting 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 9- Ricard Gutierrez-

Feature weighting The previus example illustrated the Achilles heel f the knn classifier: its sensitivity t nisy axes A pssible slutin wuld be t nrmalize each feature t N(0,) Hwever, nrmalizatin des nt reslve the curse f dimensinality. A clse lk at the Euclidean distance shws that this metric can becme very nisy fr high dimensinal prblems if nly a few f the features carry the classificatin infrmatin D d(x u,x) = (xu(k) x(k)) k= The slutin t this prblem is t mdify the Euclidean metric by a set f weights that represent the infrmatin cntent r gdness f each feature 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 9- Ricard Gutierrez-

CHAPTER 7: Linear Discriminant Functins Perceptrn learning Minimum squared errr (MSE) slutin Least-mean squares (LMS) rule 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 93- Ricard Gutierrez-

Linear Discriminant Functins () The bjective f this chapter is t present methds fr learning linear discriminant functins f the frm g ( x) = w T x + w 0 g g ( x) ( x) > 0 < 0 x ω x ω where w is the weight vectr and w 0 is the threshld weight r bias Similar discriminant functins were derived in chapter 3 as a special case f the quadratic classifier In this chapter, the discriminant functins will be derived in a nn-parametric fashin, this is, n assumptins will be made abut the underlying densities x w T x+w 0 <0 w T x+w 0 >0 x ( d x x ( w x 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 94- Ricard Gutierrez-

Linear Discriminant Functins () Fr cnvenience, we will fcus n binary classificatin Extensin t the multicategry case can be easily achieved by Using ω i /nt ω i dichtmies Using ω i /ω i dichtmies 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 95- Ricard Gutierrez-

Gradient descent () Gradient descent is a general methd fr functin minimizatin Frm basic calculus, we knw that the minimum f a functin J(x) is defined by the zers f the gradient [ ] J(x) = 0 x* = argmin J(x) x x Only in very special cases this minimizatin functin has a clsed frm slutin In sme ther cases, a clsed frm slutin may exist, but is numerically ill-psed r impractical (e.g., memry requirements) 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 96- Ricard Gutierrez-

Gradient descent () J(w) Gradient descent finds the minimum in an iterative fashin by mving in the directin f steepest descent J<0 w>0 J>0 w<0. Start with an arbitrary slutin x(0). Cmpute the gradient x J(x(k)) 3. Mve in the directin f steepest descent: x ( k + ) = x ( k ) η x J( x ( k )) 4. G t (until cnvergence) Lcal minimum w where η is a learning rate x 0 Initial guess Glbal minimum - - 0 x 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 97- Ricard Gutierrez-

Perceptrn learning () Let s nw cnsider the prblem f learning a binary classificatin prblem with a linear discriminant functin As usual, assume we have a dataset X={x (,x (, x (N } cntaining examples frm the tw classes Fr cnvenience, we will absrb the intercept w 0 by augmenting the feature vectr x with an additinal cnstant dimensin: w T x + w 0 = x [ ] T T w w = a y 0 Frm [Duda, Hart and Strk, 00] 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 98- Ricard Gutierrez-

Perceptrn learning () Keep in mind that ur bjective is t find a vectr a such that g ( x) = a T > y < 0 0 x ω x ω T simplify the derivatin, we will nrmalize the training set by replacing all examples frm class ω by their negative y [ y] y ω This allws us t ignre class labels and lk fr a weight vectr such that a T y > 0 y Frm [Duda, Hart and Strk, 00] 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 99- Ricard Gutierrez-

Perceptrn learning (3) T find this slutin we must first define an bjective functin J(a) A gd chice is what is knwn as the Perceptrn criterin J P T ( a) = ( a y) y Υ M where Y M is the set f examples misclassified by a Nte that J P (a) is nn-negative since a T y<0 fr misclassified samples 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 00- Ricard Gutierrez-

Perceptrn learning (4) T find the minimum f J P (a), we use gradient descent The gradient is defined by ( a) = ( y) And the gradient descent update rule becmes a a J P This is knwn as the perceptrn batch update rule. The weight vectr may als be updated in an n-line fashin, this is, after the presentatin f each individual example y ( k + ) = a( k) Υ M + η y Υ M y ( k ) (i ( k ) = a( k) ηy a + + Perceptrn rule where y (i is an example that has been misclassified by a(k) 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 0- Ricard Gutierrez-

Perceptrn learning (5) If classes are linearly separable, the perceptrn rule is guaranteed t cnverge t a valid slutin Hwever, if the tw classes are nt linearly separable, the perceptrn rule will nt cnverge Since n weight vectr a can crrectly classify every sample in a nn-separable dataset, the crrectins in the perceptrn rule will never cease One ad-hc slutin t this prblem is t enfrce cnvergence by using variable learning rates η(k) that apprach zer as k appraches infinite 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 0- Ricard Gutierrez-

Minimum Squared Errr slutin () The classical Minimum Squared Errr (MSE) criterin prvides an alternative t the perceptrn rule The perceptrn rule seeks a weight vectr a T that satisfies the inequality a T y (i >0 The perceptrn rule nly cnsiders misclassified samples, since these are the nly nes that vilate the abve inequality Instead, the MSE criterin lks fr a slutin t the equality a T y (i =b (i, where b (i are sme pre-specified target values (e.g., class labels) As a result, the MSE slutin uses ALL samples in the training set Frm [Duda, Hart and Strk, 00] 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 03- Ricard Gutierrez-

Minimum Squared Errr slutin () The system f equatins slved by MSE is y y M M y ( 0 ( 0 (N 0 y y y ( ( M M (N L L L y y y ( D ( D M M (N D a a M a 0 D b b = M M b ( ( (N Ya = b where a is the weight vectr, each rw in Y is a training example, and each rw in b is the crrespnding class label Fr cnsistency, we will cntinue assuming that examples frm class ω have been replaced by their negative vectr, althugh this is nt a requirement fr the MSE slutin Frm [Duda, Hart and Strk, 00] 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 04- Ricard Gutierrez-

Minimum Squared Errr slutin (3) An exact slutin t Ya=b can smetimes be fund If the number f (independent) equatins (N) is equal t the number f unknwns (D+), the exact slutin is defined by a = Y b In practice, hwever, Y will be singular s its inverse Y - des nt exist Y will cmmnly have mre rws (examples) than clumns (unknwn), which yields an ver-determined system, fr which an exact slutin cannt be fund 3 rd Shrt Curse Statistical classifiers: Bayesian decisin thery and density estimatin - 05- Ricard Gutierrez-