LECTURE 17: Linear Discriminant Functions

Size: px

Start display at page:

Download "LECTURE 17: Linear Discriminant Functions"

Lorin Rose
5 years ago
Views:

1 LECURE 7: Liear Discrimiat Fuctios Perceptro leari Miimum squared error (MSE) solutio Least-mea squares (LMS) rule Ho-Kashyap procedure Itroductio to Patter Aalysis Ricardo Gutierrez-Osua exas A&M Uiversity

2 Liear Discrimiat Fuctios () he objective of this lecture is to preset methods for leari liear discrimiat fuctios of the form ( x) = w x + w ( x) ( x) > < x ω x ω 2 where w is the weiht vector ad w is the threshold weiht or bias (ot to be cofused with that of the bias-variace dilemma) x 2 w x+w > x ( w x+w < d x x (2 w x Itroductio to Patter Aalysis Ricardo Gutierrez-Osua exas A&M Uiversity 2

uderlyi desities For coveiece, we will focus o the biary classificatio problem Extesio to the multicateory case ca be easily

3 Liear Discrimiat Fuctios (2) Similar discrimiat fuctios were derived i Lecture 5 as a special case of the quadratic classifier I this lecture, the discrimiat fuctios will be derived i a o- parametric fashio, this is, o assumptios will be made about the uderlyi desities For coveiece, we will focus o the biary classificatio problem Extesio to the multicateory case ca be easily achieved by Usi ω i /ot ω i dichotomies Usi ω i /ω i dichotomies Itroductio to Patter Aalysis Ricardo Gutierrez-Osua exas A&M Uiversity 3

4 Gradiet descet Gradiet descet is eeral method for fuctio miimizatio Recall that the miimum of a fuctio J(x) is defied by the zeros of the radiet Oly i very special cases this miimizatio fuctio has a closed form solutio I some other cases, a closed form solutio may exist, but is umerically illposed or impractical (e.., memory requiremets) Gradiet descet fids the miimum i a iterative fashio by movi i the directio of steepest descet where η is a leari rate [ ] J(x) = x* = armi J(x) x x. Start with a arbitrary solutio x() 2. Compute the radiet x J(x(k)) 3. Move i the directio of steepest descet: x( k + ) = x( k) η xj( x( k) ) 4. Go to (util coverece) x 2 2 Local miimum Iitial uess Global miimum x Itroductio to Patter Aalysis Ricardo Gutierrez-Osua exas A&M Uiversity 4

5 Perceptro leari () Let s ow cosider the problem of leari a biary classificatio problem with a liear discrimiat fuctio As usual, assume we have a dataset X={x (,x (2, x (N } cotaii examples from the two classes For coveiece, we will absorb the itercept w by aumeti the feature vector x with a additioal costat dimesio: w x + w = [ w w ] = a y x Keep i mid that our objective is to fid a vector a such that > x ω ( x) = a y < x ω2 o simplify the derivatio, we will ormalize the traii set by replaci all examples from class ω 2 by their eative y [ y] y ω2 his allows us to iore class labels ad look for a weiht vector such that a y > y Itroductio to Patter Aalysis Ricardo Gutierrez-Osua exas A&M Uiversity From [Duda, Hart ad Stork, 2] 5

6 Perceptro leari (2) o fid this solutio we must first defie a objective fuctio J(a) A ood choice is what is kow as the Perceptro criterio fuctio where Y M is the set of examples misclassified by a. Note that J P (a) is o-eative sice a y< for all misclassified samples o fid the miimum of this fuctio, we use radiet descet he radiet is defied by Ad the radiet descet update rule becomes his is kow as the perceptro batch update rule. a J a P ( a) = ( a y) J P y Υ M ( a) = ( y) y Υ M ( k + ) = a( k) + η he weiht vector may also be updated i a o-lie fashio, this is, after the presetatio of each idividual example y Υ M y ( k ) Perceptro rule (i ( k + ) = a( k) ηy a + where y (i is a example that has bee misclassified by a(k) Itroductio to Patter Aalysis Ricardo Gutierrez-Osua exas A&M Uiversity 6

7 Perceptro leari (3) If classes are liearly separable, the perceptro rule is uarateed to covere to a valid solutio Some versio of the perceptro rule use a variable leari rate η(k) I this case, coverece is uarateed oly uder certai coditios (for details refer to [Duda, Hart ad Stork, 2], pp ) However, if the two classes are ot liearly separable, the perceptro rule will ot covere Sice o weiht vector a ca correctly classify every sample i a o-separable dataset, the correctios i the perceptro rule will ever cease Oe ad-hoc solutio to this problem is to eforce coverece by usi variable leari rates η(k) that approach zero as k approaches ifiite Itroductio to Patter Aalysis Ricardo Gutierrez-Osua exas A&M Uiversity 7

8 Miimum Squared Error solutio () he classical Miimum Squared Error (MSE) criterio provides a alterative to the perceptro rule he perceptro rule seeks a weiht vector a that satisfies the iequality a y (i > he perceptro rule oly cosiders misclassified samples, sice these are the oly oes that violate the above iequality Istead, the MSE criterio looks for a solutio to the equality a y (i =b (i, where b (i are some pre-specified taret values (e.., class labels) As a result, the MSE solutio uses ALL of the samples i the traii set: he system of equatios solved by MSE is where a is the weiht vector, each row i Y is a traii example, ad each row i b is the correspodi class label y y M M y ( (2 (N y y y ( (2 M M (N L L L y y y ( D (2 D M M (N D a a M a D b b = M M b Ya = b For cosistecy, we will cotiue assumi that examples from class ω 2 have bee replaced by their eative vector, althouh this is ot a requiremet for the MSE solutio ( (2 (N Itroductio to Patter Aalysis Ricardo Gutierrez-Osua exas A&M Uiversity From [Duda, Hart ad Stork, 2] 8

9 Miimum Squared Error solutio (2) A exact solutio to Ya=b ca sometimes be foud If the umber of (idepedet) equatios (N) is equal to the umber of ukows (D+), the exact solutio is defied by a = Y b I practice, however, Y will be siular so its iverse Y - does ot exist Y will commoly have more rows (examples) tha colums (ukow), which yields a over-determied system, for which a exact solutio caot be foud he solutio i this case is to fid a weiht vector that miimizes some fuctio of the error betwee the model (ay) ad the desired output (b) I particular, MSE seeks to Miimize the sum of the Squares of these Errors: N (i (i 2 2 ( a) ( a y b ) Ya -b J MSE = = i= which, as usual, ca be foud by setti its radiet to zero Itroductio to Patter Aalysis Ricardo Gutierrez-Osua exas A&M Uiversity 9

10 he pseudo-iverse solutio he radiet of the objective fuctio is with zeros defied by N (i (i (i ( a) = 2( a y b ) y = 2Y ( Ya b) a JMSE = i= Ya = Y Notice that Y Y is ow a square matrix! If Y Y is osiular, the MSE solutio becomes Y b - ( Y Y) Y b = Y b a = Pseudo-iverse solutio where the matrix Y =(Y Y) - Y is kow as the pseudo-iverse of Y (Y Y=I) Note that, i eeral, YY I i eeral Itroductio to Patter Aalysis Ricardo Gutierrez-Osua exas A&M Uiversity

11 Ride-reressio solutio If the traii data is extremely correlated (colliearity problem), the matrix Y Y becomes ear siular As a result, the smaller eievalues (the oise) domiate the computatio of the iverse (Y Y) -, which results i umerical problems he colliearity problem ca be solved throuh reularizatio his is equivalet to addi a small multiple of the idetity matrix to the term Y Y, which results i tr ( ) ( Y Y) a = - ε Y Y + ε I Y b D Ride Reressio - where ε (<ε<) is a reularizatio parameter that cotrols the amout of shrikae to the idetity matrix. his is kow as the ride-reressio solutio If the features have siificatly differet variaces, the reularizatio term may be replaced by a diaoal matrix of the feature variaces Selectio of the reularizatio parameter For ε=, ride-reressio solutio is equivalet to the pseudo-iverse solutio For ε=, the ride-reressio solutio is a costat fuctio that predicts the averae classificatio rate across the etire dataset A appropriate value for ε is typically foud throuh cross-validatio Itroductio to Patter Aalysis Ricardo Gutierrez-Osua exas A&M Uiversity From [Gutierrez-Osua, 22]

12 Least-mea-squares solutio he objective fuctio J MSE (a)= Ya-b 2 ca also be foud usi a radiet descet procedure his avoids the problems that arise whe Y Y is siular I additio, it also avoids the eed for worki with lare matrices Looki at the expressio of the radiet, the obvious update rule is ( + ) = a( k) + η( k) Y ( b - Ya( k) ) a k It ca be show that if η(k)=η()/k, where η() is ay positive costat, this rule eerates a sequece of vectors that covere to a solutio to Y (Ya-b)= he storae requiremets of this alorithm ca be reduced by cosideri each sample sequetially ( (i ) y (i ( k ) = a( k) η( k) b - y (i a( k) a + + LMS rule his is kow as the Widrow-Hoff, least-mea-squares (LMS) or delta rule [Mitchell, 997] Itroductio to Patter Aalysis Ricardo Gutierrez-Osua exas A&M Uiversity From [Duda, Hart ad Stork, 2] 2

13 Numerical example Compute the perceptro ad MSE solutio for the followi dataset X = [ (,6), (7,2), (8,9), (9,9)] X2 = [ (2,), (2,2), (2,4), (7,)] x 2 Assume η= Y = 2 Assume a()=[.,.,.] Use a olie update rule 2 4 SOLUION 7 Normalize the dataset x Iterate throuh all the examples ad update a(k) o the oes that are misclassified. Y() [ 6]*[...] > o update 2. Y(2) [ 7 2]*[...] > o update Y(5) [- -2 -]*[...] < update a() = [...] + η*[- -2 -] = [ -. ] 5. Y(6) [- -2-2]*[ -. ] > o update Y() [ 6]*[ -. ] < update a(2) = [ -. ] +η*[ 6] = [..6] 8. Y(2) [ 7 2]*[..6] > o update 9.. I this example, the perceptro rule coveres after 75 iteratios to a=[ ] o covice yourself this is a solutio, compute Y*a (you will fid out that all of the terms are o-eative) Perceptro leai MSE he MSE solutio is foud i oe shot as a=(y Y) - Y b=[ ]; For the choice of tarets b = [ ] As you ca see i the fiure, the MSE solutio misclassifies oe of the samples 6 8 MSE Perceptro Itroductio to Patter Aalysis Ricardo Gutierrez-Osua exas A&M Uiversity 3

14 Summary: Perceptro vs. MSE procedures Perceptro rule he perceptro rule always fids a solutio if the classes are liearly separable, but does ot covere if the classes are o-separable MSE criterio he MSE solutio has uarateed coverece, but it may ot fid a separati hyperplae if classes are liearly separable Notice that MSE tries to miimize the sum of the squares of the distaces of the traii data to the separati hyperplae, as opposed to fidi this hyperplae Perceptro x 2 LMS x Itroductio to Patter Aalysis Ricardo Gutierrez-Osua exas A&M Uiversity 4

15 he Ho-Kashyap procedure () he mai limitatio of the MSE criterio is the lack of uaratees that a separati hyperplae will be foud i the liearly separable case All we ca say about the MSE rule is that it miimizes Ya-b 2 Whether MSE fids a separati hyperplae or ot depeds o how properly the taret outputs b (i are selected Now, if the two classes are liearly separable, there must exist vectors a* ad b* such that Ya*=b*> If b* were kow, oe could simply use the MSE solutio (a=y b) to compute the separati hyperplae However, sice b* is ukow, oe must the solve for BOH a ad b his idea ives rise to a alterative traii alorithm for liear discrimiat fuctios kow as the Ho-Kashyap procedure ) Fid the taret values b throuh radiet descet 2) Compute the weiht vector a from the MSE solutio 3) Repeat ) ad 2) util coverece Itroductio to Patter Aalysis Ricardo Gutierrez-Osua exas A&M Uiversity Here we also assume y [ y] y ω 2 ) 5

16 he Ho-Kashyap procedure (2) he radiet b J is defied by which suest a possible update rule for b Now, sice b is subject to the costrait b>, we are ot free to follow whichever directio the radiet may poit to he solutio is to start with a iitial solutio b>, ad refuse to reduce ay of its compoets his is accomplished by setti to zero all the positive compoets of the radiet b J b J MSE ( a,b) = 2( Ya b) J b = bj - bj 2 where deotes the absolute value of the compoets of the arumet vector ( ) [ ] Oce b is updated, the MSE solutio a=y b directly provides the zeros of a J he resulti iterative procedure is b a ( k + ) = b( k) η [ J- J] 2 ( k + ) = Y b( k + ) b b Ho-Kashyap procedure Itroductio to Patter Aalysis Ricardo Gutierrez-Osua exas A&M Uiversity 6

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice 0//008 Liear Discrimiat Fuctios Jacob Hays Amit Pillay James DeFelice 5.8, 5.9, 5. Miimum Squared Error Previous methods oly worked o liear separable cases, by lookig at misclassified samples to correct