Support Vector Machines and Flexible Discriminants

Size: px

Start display at page:

Download "Support Vector Machines and Flexible Discriminants"

Sophie Perkins
5 years ago
Views:

1 12 Supprt Vectr Machines and Flexible Discriminants This is page 417 Printer: Opaque this 12.1 Intrductin In this chapter we describe generalizatins f linear decisin bundaries fr classificatin. Optimal separating hyperplanes are intrduced in Chapter 4 fr the case when tw classes are linearly separable. Here we cver extensins t the nnseparable case, where the classes verlap. These techniques are then generalized t what is knwn as the supprt vectr machine, which prduces nnlinear bundaries by cnstructing a linear bundary in alarge,transfrmedversinfthefeaturespace.thesecndsetfmethds generalize Fisher s linear discriminant analysis (LDA). The generalizatins include flexible discriminant analysis which facilitates cnstructin f nnlinear bundaries in a manner very similar t the supprt vectr machines, penalized discriminant analysis fr prblems such as signal and image classificatin where the large number f features are highly crrelated, and mixture discriminant analysis fr irregularly shaped classes The Supprt Vectr Classifier In Chapter 4 we discussed a technique fr cnstructing an ptimal separating hyperplane between tw perfectly separated classes. We review this and generalize t the nnseparable case, where the classes may nt be separable by a linear bundary.

2 Flexible Discriminants x T β + β 0 =0 M = 1 β M = 1 β margin x T β + β 0 =0 ξ 4 ξ5 ξ ξ 3 1 ξ 2 M = 1 β M = 1 β margin FIGURE Supprt vectr classifiers. The left panel shws the separable case. The decisin bundary is the slid line, while brken lines bund the shaded maximal margin f width 2M =2/ β. The right panel shws the nnseparable (verlap) case. The pints labeled ξj are n the wrng side f their margin by an amunt ξj = Mξ j;pintsnthecrrectsidehaveξj =0.Themarginis maximized subject t a ttal budget ξ i cnstant. Hence ξj is the ttal distance f pints n the wrng side f their margin. Our training data cnsists f N pairs (x 1,y 1 ), (x 2,y 2 ),...,(x N,y N ), with x i IR p and y i { 1, 1}. Defineahyperplaneby {x : f(x) =x T β + β 0 =0}, (12.1) where β is a unit vectr: β =1.Aclassificatinruleinducedbyf(x) is G(x) =sign[x T β + β 0 ]. (12.2) The gemetry f hyperplanes is reviewed in Sectin 4.5, whereweshwthat f(x) in(12.1)givesthesigneddistancefrmapintx t the hyperplane f(x) =x T β +β 0 =0.Sincetheclassesareseparable,wecanfindafunctin f(x) = x T β + β 0 with y i f(x i ) > 0 i. Henceweareabletfindthe hyperplane that creates the biggest margin between the training pints fr class 1 and 1 (seefigure12.1).theptimizatinprblem max M β,β 0, β =1 subject t y i (x T i β + β 0 ) M, i =1,...,N, (12.3) captures this cncept. The band in the figure is M units away frm the hyperplane n either side, and hence 2M units wide. It is called the margin. We shwed that this prblem can be mre cnveniently rephrased as min β β,β 0 subject t y i (x T i β + β 0 ) 1,,...,N, (12.4)

3 12.2 The Supprt Vectr Classifier 419 where we have drpped the nrm cnstraint n β. NtethatM =1/ β. Expressin (12.4) is the usual way f writing the supprt vectr criterin fr separated data. This is a cnvex ptimizatin prblem (quadratic criterin, linear inequality cnstraints), and the slutin is characterizedin Sectin Suppse nw that the classes verlap in feature space. One way tdeal with the verlap is t still maximize M, butallwfrsmepintstben the wrng side f the margin. Define the slack variables ξ =(ξ 1,ξ 2,...,ξ N ). There are tw natural ways t mdify the cnstraint in (12.3): y i (x T i β + β 0 ) M ξ i, (12.5) r y i (x T i β + β 0 ) M(1 ξ i ), (12.6) i, ξ i 0, N ξ i cnstant. The tw chices lead t different slutins. The first chice seems mre natural, since it measures verlap inactual distance frm the margin; the secnd chice measures the verlap in relative distance, which changes with the width f the margin M. Hwever,thefirst chice results in a nncnvex ptimizatin prblem, while the secnd is cnvex; thus (12.6) leads t the standard supprt vectr classifier, which we use frm here n. Here is the idea f the frmulatin. The value ξ i in the cnstraint y i (x T i β+ β 0 ) M(1 ξ i ) is the prprtinal amunt by which the predictin f(x i )=x T i β +β 0 is n the wrng side f its margin. Hence by bunding the sum ξ i,webundthettalprprtinalamuntbywhichpredictins fall n the wrng side f their margin. Misclassificatins ccur when ξ i > 1, s bunding ξ i at a value K say, bunds the ttal number f training misclassificatins at K. As in (4.48) in Sectin 4.5.2, we can drp the nrm cnstraint n β, define M =1/ β, andwrite(12.4)intheequivalentfrm min β subject t { y i (x T i β + β 0) 1 ξ i i, ξ i 0, ξ i cnstant. (12.7) This is the usual way the supprt vectr classifier is defined fr the nnseparable case. Hwever we find cnfusing the presence f the fixed scale 1 in the cnstraint y i (x T i β + β 0) 1 ξ i,andprefertstartwith(12.6). The right panel f Figure 12.1 illustrates this verlapping case. By the nature f the criterin (12.7), we see that pints well inside their class bundary d nt play a big rle in shaping the bundary. This seems like an attractive prperty, and ne that differentiates it frm linear discriminant analysis (Sectin 4.3). In LDA, the decisin bundary is determined by the cvariance f the class distributins and the psitins f the class centrids. We will see in Sectin that lgistic regressin is mre similar t the supprt vectr classifier in this regard.

4 Flexible Discriminants Cmputing the Supprt Vectr Classifier The prblem (12.7) is quadratic with linear inequality cnstraints, hence it is a cnvex ptimizatin prblem. We describe a quadratic prgramming slutin using Lagrange multipliers. Cmputatinally it is cnvenientt re-express (12.7) in the equivalent frm 1 min β,β 0 2 β 2 + C subject t ξ i ξ i 0, y i (x T i β + β 0 ) 1 ξ i i, (12.8) where the cst parameter C replaces the cnstant in (12.7); the separable case crrespnds t C =. The Lagrange (primal) functin is L P = 1 2 β 2 + C ξ i α i [y i (x T i β + β 0 ) (1 ξ i )] µ i ξ i, (12.9) which we minimize w.r.t β, β 0 and ξ i.settingtherespectivederivativest zer, we get β = 0 = α i y i x i, (12.10) α i y i, (12.11) α i = C µ i, i, (12.12) as well as the psitivity cnstraints α i, µ i, ξ i 0 i. Bysubstituting (12.10) (12.12) int (12.9), we btain the Lagrangian (Wlfe) dual bjective functin L D = α i 1 α i α i y i y i x T i x i, (12.13) 2 i =1 which gives a lwer bund n the bjective functin (12.8) fr anyfeasible pint. We maximize L D subject t 0 α i C and N α iy i =0.In additin t (12.10) (12.12), the Karush Kuhn Tucker cnditins include the cnstraints α i [y i (x T i β + β 0 ) (1 ξ i )] = 0, (12.14) µ i ξ i = 0, (12.15) y i (x T i β + β 0 ) (1 ξ i ) 0, (12.16) fr i =1,...,N.Tgethertheseequatins(12.10) (12.16)uniquelycharacterize the slutin t the primal and dual prblem.

5 12.2 The Supprt Vectr Classifier 421 Frm (12.10) we see that the slutin fr β has the frm ˆβ = ˆα i y i x i, (12.17) with nnzer cefficients ˆα i nly fr thse bservatins i fr which the cnstraints in (12.16) are exactly met (due t (12.14)). These bservatins are called the supprt vectrs, since ˆβ is represented in terms f them alne. Amng these supprt pints, sme will lie n the edge f themargin (ˆξ i = 0), and hence frm (12.15) and (12.12) will be characterized by 0 < ˆα i <C;theremainder(ˆξ i > 0) have ˆα i = C. Frm(12.14)wecan see that any f these margin pints (0 < ˆα i, ˆξi =0)canbeusedtslve fr β 0,andwetypicallyuseanaveragefalltheslutinsfrnumerical stability. Maximizing the dual (12.13) is a simpler cnvex quadratic prgramming prblem than the primal (12.9), and can be slved with standard techniques (Murray et al., 1981, fr example). Given the slutins ˆβ 0 and ˆβ, thedecisinfunctincanbewrittenas Ĝ(x) = sign[ˆf(x)] = sign[x T ˆβ + ˆβ0 ]. (12.18) The tuning parameter f this prcedure is the cst parameter C Mixture Example (Cntinued) Figure 12.2 shws the supprt vectr bundary fr the mixture example f Figure 2.5 n page 21, with tw verlapping classes, fr tw different values f the cst parameter C. Theclassifiersarerathersimilarintheir perfrmance. Pints n the wrng side f the bundary are supprt vectrs. In additin, pints n the crrect side f the bundary but clse t it (in the margin), are als supprt vectrs. The margin is larger fr C =0.01 than it is fr C =10, 000. Hence larger values f C fcus attentin mre n (crrectly classified) pints near the decisin bundary, while smaller values invlve data further away. Either way, misclassified pints are given weight, n matter hw far away. In this example the prcedure is nt very sensitive t chices f C, becauseftherigidityfalinearbundary. The ptimal value fr C can be estimated by crss-validatin, as discussed in Chapter 7. Interestingly, the leave-ne-ut crss-validatin errr can be bunded abve by the prprtin f supprt pints in the data.the reasn is that leaving ut an bservatin that is nt a supprt vectrwill nt change the slutin. Hence these bservatins, being classified crrectly by the riginal bundary, will be classified crrectly in the crss-validatin prcess. Hwever this bund tends t be t high, and nt generally useful fr chsing C (62% and 85%, respectively, in ur examples).

6 Flexible Discriminants Training Errr: Test Errr: Bayes Errr: C = Training Errr: 0.26 Test Errr: 0.30 Bayes Errr: 0.21 C =0.01 FIGURE The linear supprt vectr bundary fr the mixture data example with tw verlapping classes, fr tw different values f C. The brken lines indicate the margins, where f(x) =±1. Thesupprtpints(α i > 0) areallthe pints n the wrng side f their margin. The black slid dts are thsesupprt pints falling exactly n the margin (ξ i =0,α i > 0). In the upper panel 62% f the bservatins are supprt pints, while in the lwer panel 85% are. The brken purple curve in the backgrund is the Bayes decisin bundary.

7 12.3 Supprt Vectr Machines and Kernels Supprt Vectr Machines and Kernels The supprt vectr classifier described s far finds linear bundaries in the input feature space. As with ther linear methds, we can make theprcedure mre flexible by enlarging the feature space using basis expansins such as plynmials r splines (Chapter 5). Generally linear bundaries in the enlarged space achieve better training-class separatin, and translate t nnlinear bundaries in the riginal space. Once the basis functins h m (x), m=1,...,m are selected, the prcedure is the same as befre. We fit the SV classifier using input features h(x i )=(h 1 (x i ),h 2 (x i ),...,h M (x i )), i =1,...,N,andprducethe(nnlinear)functin ˆf(x) =h(x) T ˆβ + ˆβ 0. The classifier is Ĝ(x) =sign(ˆf(x)) as befre. The supprt vectr machine classifier is an extensin f this idea, where the dimensin f the enlarged space is allwed t get very large, infinite in sme cases. It might seem that the cmputatins wuld becme prhibitive. It wuld als seem that with sufficient basis functins, the data wuld be separable, and verfitting wuld ccur. We first shw hw the SVM technlgy deals with these issues. We then see that in fact the SVM classifier is slving a functin-fitting prblem using a particular criterin and frm f regularizatin, and is part f a much bigger class f prblems that includes the smthing splines f Chapter 5. The reader may wish t cnsult Sectin 5.8, which prvides backgrund material and verlaps smewhat with the next tw sectins Cmputing the SVM fr Classificatin We can represent the ptimizatin prblem (12.9) and its slutin in a special way that nly invlves the input features via inner prducts. We d this directly fr the transfrmed feature vectrs h(x i ). We then see that fr particular chices f h, theseinnerprductscanbecmputedverycheaply. The Lagrange dual functin (12.13) has the frm L D = α i 1 2 i =1 α i α i y i y i h(x i ),h(x i ). (12.19) Frm (12.10) we see that the slutin functin f(x) canbewritten f(x) = h(x) T β + β 0 = α i y i h(x),h(x i ) + β 0. (12.20) As befre, given α i, β 0 can be determined by slving y i f(x i )=1in(12.20) fr any (r all) x i fr which 0 <α i <C.

8 Flexible Discriminants S bth (12.19) and (12.20) invlve h(x) nlythrughinnerprducts.in fact, we need nt specify the transfrmatin h(x) atall,butrequirenly knwledge f the kernel functin K(x, x )= h(x),h(x ) (12.21) that cmputes inner prducts in the transfrmed space. K shuld be a symmetric psitive (semi-) definite functin; see Sectin Three ppular chices fr K in the SVM literature are dth-degree plynmial: K(x, x )=(1+ x, x ) d, Radial basis: K(x, x )=exp( γ x x 2 ), Neural netwrk: K(x, x )=tanh(κ 1 x, x + κ 2 ). (12.22) Cnsider fr example a feature space with tw inputs X 1 and X 2,anda plynmial kernel f degree 2. Then K(X, X )=(1+ X, X ) 2 =(1+X 1 X 1 + X 2 X 2) 2 =1+2X 1 X 1 +2X 2 X 2 +(X 1 X 1) 2 +(X 2 X 2) 2 +2X 1 X 1X 2 X 2. (12.23) Then M =6,andifwechseh 1 (X) =1,h 2 (X) = 2X 1, h 3 (X) = 2X2, h 4 (X) =X 2 1, h 5 (X) =X 2 2,andh 6 (X) = 2X 1 X 2,thenK(X, X )= h(x),h(x ). Frm(12.20)weseethattheslutincanbewritten ˆf(x) = ˆα i y i K(x, x i )+ ˆβ 0. (12.24) The rle f the parameter C is clearer in an enlarged feature space, since perfect separatin is ften achievable there. A large value f C will discurage any psitive ξ i,andleadtanverfitwigglybundaryinthe riginal feature space; a small value f C will encurage a small value f β, whichinturncausesf(x) andhencethebundarytbesmther. Figure 12.3 shw tw nnlinear supprt vectr machines applied t the mixture example f Chapter 2. The regularizatin parameter was chsen in bth cases t achieve gd test errr. The radial basis kernel prduces abundaryquitesimilartthebayesptimalbundaryfrthis example; cmpare Figure 2.5. In the early literature n supprt vectrs, there were claims thatthe kernel prperty f the supprt vectr machine is unique t it and allws ne t finesse the curse f dimensinality. Neither f these claims is true, and we g int bth f these issues in the next three subsectins.

9 12.3 Supprt Vectr Machines and Kernels 425 SVM - Degree-4 Plynmial in Feature Space Training Errr: Test Errr: Bayes Errr: SVM - Radial Kernel in Feature Space Training Errr: Test Errr: Bayes Errr: FIGURE Tw nnlinear SVMs fr the mixture data. The upper plt uses a 4th degree plynmial kernel, the lwer a radial basis kernel (with γ =1). In each case C was tuned t apprximately achieve the best test errr perfrmance, and C =1wrked well in bth cases. The radial basis kernel perfrms the best (clse t Bayes ptimal), as might be expected given the data arise frm mixtures f Gaussians. The brken purple curve in the backgrund is the Bayes decisin bundary.

10 Flexible Discriminants Lss Hinge Lss Binmial Deviance Squared Errr Class Huber yf FIGURE The supprt vectr lss functin (hinge lss), cmpared t the negative lg-likelihd lss (binmial deviance) fr lgistic regressin, squared-errr lss, and a Huberized versin f the squared hinge lss. All are shwn as a functin f yf rather than f, becausefthesymmetrybetweenthey = +1 and y = 1 case. The deviance and Huber have the same asympttes as the SVM lss, but are runded in the interir. All are scaled t have the limiting left-tail slpe f The SVM as a Penalizatin Methd With f(x) =h(x) T β + β 0,cnsidertheptimizatinprblem min β 0,β [1 y i f(x i )] + + λ 2 β 2 (12.25) where the subscript + indicates psitive part. This has the frm lss + penalty, which is a familiar paradigm in functin estimatin.it is easy t shw (Exercise 12.1) that the slutin t (12.25), with λ =1/C, isthe same as that fr (12.8). Examinatin f the hinge lss functin L(y, f) =[1 yf] + shws that it is reasnable fr tw-class classificatin, when cmpared tthermre traditinal lss functins. Figure 12.4 cmpares it t the lg-likelihd lss fr lgistic regressin, as well as squared-errr lss and a variant theref. The (negative) lg-likelihd r binmial deviance has similar tails as the SVM lss, giving zer penalty t pints well inside their margin, and a

11 12.3 Supprt Vectr Machines and Kernels 427 TABLE The ppulatin minimizers fr the different lss functins in Figure Lgistic regressin uses the binmial lg-likelihd r deviance. Linear discriminant analysis (Exercise 4.2) uses squared-errr lss. The SVM hinge lss estimates the mde f the psterir class prbabilities, whereas the thers estimate a linear transfrmatin f these prbabilities. Lss Functin L[y, f(x)] Minimizing Functin Binmial Pr(Y = +1 x) Deviance lg[1 + e yf(x) ] f(x) =lg Pr(Y =-1 x) SVM Hinge Lss Squared Errr [1 yf(x)] + f(x) =sign[pr(y = +1 x) 1 2 ] [y f(x)] 2 =[1 yf(x)] 2 f(x) =2Pr(Y = +1 x) 1 Huberised Square Hinge Lss 4yf(x), yf(x) < -1 [1 yf(x)] 2 + therwise f(x) =2Pr(Y = +1 x) 1 linear penalty t pints n the wrng side and far away. Squared-errr, n the ther hand gives a quadratic penalty, and pints well inside their wn margin have a strng influence n the mdel as well. The squared hinge lss L(y, f) =[1 yf] 2 + is like the quadratic, except it is zer fr pints inside their margin. It still rises quadratically in the left tail,andwillbe less rbust than hinge r deviance t misclassified bservatins. Recently Rsset and Zhu (2007) prpsed a Huberized versin f the squared hinge lss, which cnverts smthly t a linear lss at yf = 1. We can characterize these lss functins in terms f what they are estimating at the ppulatin level. We cnsider minimizing EL(Y,f(X)). Table 12.1 summarizes the results. Whereas the hinge lss estimates the classifier G(x) itself,allthethersestimateatransfrmatinftheclass psterir prbabilities. The Huberized square hinge lss sharesattractive prperties f lgistic regressin (smth lss functin, estimates prbabilities), as well as the SVM hinge lss (supprt pints). Frmulatin (12.25) casts the SVM as a regularized functin estimatin prblem, where the cefficients f the linear expansin f(x) =β 0 + h(x) T β are shrunk tward zer (excluding the cnstant). If h(x)representsahierarchical basis having sme rdered structure (such as rdered in rughness),

12 Flexible Discriminants then the unifrm shrinkage makes mre sense if the rugher elements h j in the vectr h have smaller nrm. All the lss-functins in Table 12.1 except squared-errr are s called margin maximizing lss-functins (Rsset et al., 2004b). Thismeansthat if the data are separable, then the limit f ˆβ λ in (12.25) as λ 0defines the ptimal separating hyperplane Functin Estimatin and Reprducing Kernels Here we describe SVMs in terms f functin estimatin in reprducing kernel Hilbert spaces, where the kernel prperty abunds. This material is discussed in sme detail in Sectin 5.8. This prvides anther view f the supprt vectr classifier, and helps t clarify hw it wrks. Suppse the basis h arises frm the (pssibly finite) eigen-expansin f apsitivedefinitekernelk, K(x, x )= φ m (x)φ m (x )δ m (12.26) m=1 and h m (x) = δ m φ m (x). Then with θ m = δ m β m,wecanwrite(12.25) as [ ] min 1 y i (β 0 + θ m φ m (x i )) + λ θm 2. (12.27) β 0,θ 2 δ m m=1 + m=1 Nw (12.27) is identical in frm t (5.49) n page 169 in Sectin 5.8, and the thery f reprducing kernel Hilbert spaces described there guarantees afinite-dimensinalslutinfthefrm f(x) =β 0 + α i K(x, x i ). (12.28) In particular we see there an equivalent versin f the ptimizatin criterin (12.19) [Equatin (5.67) in Sectin 5.8.2; see als Wahba et al. (2000)], min β 0,α (1 y i f(x i )) + + λ 2 αt Kα, (12.29) where K is the N N matrix f kernel evaluatins fr all pairs f training features (Exercise 12.2). These mdels are quite general, and include, fr example, the entire family f smthing splines, additive and interactin spline mdels discussed 1 Fr lgistic regressin with separable data, ˆβ λ diverges, but ˆβ λ / ˆβ λ cnverges t the ptimal separating directin.

Pattern Recognition 2014 Support Vector Machines

Pattern Recognition 2014 Support Vector Machines Pattern Recgnitin 2014 Supprt Vectr Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 1 / 55 Overview 1 Separable Case 2 Kernel Functins 3 Allwing Errrs (Sft