Margin Distribution and Learning Algorithms

Size: px

Start display at page:

Download "Margin Distribution and Learning Algorithms"

Cecil Sims
6 years ago
Views:

1 ICML 03 Margin Distributin and Learning Algrithms Ashutsh Garg IBM Almaden Research Center, San Jse, CA 9513 USA Dan Rth Department f Cmputer Science, University f Illinis, Urbana, IL USA ASHUTOSH@US.IBM.COM DANR@UIUC.EDU Abstract Recent theretical results have shwn that imprved bunds n generalizatin errr f classifiers can be btained by explicitly taking the bserved margin distributin f the training data int accunt. Currently, algrithms used in practice d nt make use f the margin distributin and are driven by ptimizatin with respect t the pints that are clsest t the hyperplane. This paper enhances earlier theretical results and derives a practical data-dependent cmplexity measure fr learning. The new cmplexity measure is a functin f the bserved margin distributin f the data, and can be used, as we shw, as a mdel selectin criterin. We then present the Margin Distributin Optimizatin (MDO) learning algrithm, that directly ptimizes this cmplexity measure. Empirical evaluatin f MDO demnstrates that it cnsistently utperfrms SVM. 1. Intrductin The study f generalizatin abilities f learning algrithms and its dependence n sample cmplexity is ne f the fundamental research effrts in learning thery. Understanding the inherent difficulty f learning prblems allws ne t evaluate the pssibility f learning in certain situatins, estimate the degree f cnfidence in the predictins made, and is crucial in understanding, analyzing, and develping imprved learning algrithms. Recent effrts in these directins (Garg et al., 00; Langfrd & Shawe-Taylr, 00) were devted t develping generalizatin bunds fr linear classifiers which make use f the actual bserved margin distributin n the training data, rather than relying nly n the distance f the pints clsest t the hyperplane (the margin f the classifier). Similar results have been btained earlier fr a restricted case, when a cnvex cmbinatin f multiple classifiers is used as the classifier (Schapire et al., 1997). At the heart f these results are analysis techniques that can use an apprpriately weighted cmbinatin f all the data pints, weighted accrding t their distance frm the hyperplane. This paper shws that these theretical results can be made practical, be used fr mdel selectin, and t drive a new learning algrithm. Building n (Garg et al., 00), which intrduced the data dependent generalizatins bunds, we first develp a mre realistic versin f the bund, that takes int accunt the classifier s bias term. An imprtant utcme f (Garg et al., 00), which we slightly mdify here, is a data dependent cmplexity measure fr learning which we call the prjectin prfile f the data. The prjectin prfile f data sampled accrding t a distributin D, is the expected amunt f errr intrduced when a classifier h is randmly prjected, alng with the data, int a k-dimensinal space. Our analysis shws that it is captured by the fllwing quantity: a k (D,h)= x D u k(x)dd, where ( ( u k (x) =min 3exp ), ) (ν(x)+b), 1 (ν(x)+b) k 8( + (ν(x)) ) (1) and ν(x) is the distance between x and the classifying hyperplane 1 defined by h, a linear classifier fr D and b is the bias f the classifier. The sequence P(D,h) = (a 1 (D,h),a (D,h),...) is the prjectin prfile f D. The prjectin prfile turns ut t be quite infrmative, bth theretically and in practice. (Garg et al., 00) prved its relevance t generalizatin perfrmance and used it t develp sample cmplexity bunds (bunds n generalizatin errr) that are mre infrmative than existing bunds fr high dimensinal learning prblems. The main cntributin f this paper is t shw that the prjectin prfile f the bserved data with respect t a learned classifier can be directly ptimized, yielding a new learning algrithm fr linear classifiers, MDO (Margin Distributin Optimizatin), that attempts t be ptimal with respect t the margin distributin based cmplexity measure. Specifically, we first argue that this cmplexity measure can be 1 Our analysis des nt assume linearly separable data.

2 used fr mdel selectin. Empirically, we shw that the margin distributin f the data with respect t a classifier behaves differently than the margin and bserve that it is bth better crrelated with its accuracy and is mre stable in terms f measurements ver the training data and expected values. With this as mtivatin we develp an apprximatin t Eqn. 1 that is used t drive an algrithm that learns a linear classifier which directly ptimizes this measure. We use several data sets t cmpare the resulting classifiers t thse achieved by ptimizing the margin, an in SVM, and shw that MDO yields better classifiers. The paper is rganized as fllws. In the next sectin we intrduce the ntatins and sme definitins that will be used in later sectins. Sec. 3 intrduces ur enhanced versin f the data dependent generalizatin bund based n the margin distributin f the training data and discusses its implicatins. In Sec. 4 we shw that the prjectin prfile is nt nly meaningful fr the purpse f generalizatin bunds but can als be used as a mdel selectin criterin. The MDO algrithms is intrduced in Sec. 5 and experimental results with it are presented in Sec. 6. We cnclude by discussing sme pen issues.. Preliminaries We study a binary classificatin prblem f : IR n { 1, 1}, a mapping frm a n dimensinal space t class labels { 1, 1}. Let S = {(x 1,y 1 ),...,(x m,y m )} dentes a sample set f m examples. The hypthesis h IR n is an n-dimensinal linear classifier and b is the bias term. That is, fr an example x IR n, the hypthesis predicts ŷ(x) = sign(h T x + b). We use n t dente the riginal (high) dimensinality f the data, k t dente the (smaller) dimensin int which prjectins are made and m t dente the sample size f the data. The subscript will refer t the index f the example in the sample set and the superscript will refer t particular dimensin that is under cnsideratin. Definitin.1 Under 0-1 lss functin, the empirical errr Ê f h ver a sample set S and the expected errr E are given resp. by, Ê(h, S) = 1 I(ŷ(x i) y i); m i ] E(h, S) =E x [I(ŷ(x) f(x)), where I( ) is the indicatr functin which is 1 when its argument is true and 0 therwise. The expectatin E x is ver the distributin f data. We dente by the L nrm f a vectr. We will assume w.l..g that all data pints cme frm the surface f unit sphere (ie. x, x = 1), and that h = 1 (The nn unity nrm f classifier and data will be accunted fr by nrmalizing the bias b.) Let ν(x) =h T x dente the signed distance f the sample pint x frm the classifier h. When x j refers t the j th sample frm S, we dente it by ν j = ν(x j )=h T x j. With this ntatin (mitting the classifier h) the classificatin rule reduces simply t sign(ν(x)+b). Our analysis f margin distributin based bunds is based n results in the area f randm prjectin f high dimensinal data, develped in (Jhnsn & Lindenstrauss, 1984), and further studied in several ther wrks, e.g., (Arriaga & Vempala, 1999). Briefly, the methd f randm prjectin shws that with high prbability, when n dimensinal data is prjected dwn t a lwer dimensinal space f dimensin k, using a randm k n matrix, relative distances between pints are almst preserved. We will use this t shw that if the data is separable with large margin in a high dimensinal space, then it can be prjected dwn t a lw dimensinal space withut incurring much errr. Next we give the definitin f randm matrix: Definitin. (Randm Matrix) A randm prjectin matrix R is a k n matrix whse each entry r ij N(0, 1/k). Fr x IR n, we dente by x = Rx IR k the prjectin f x frm an n t a k-dimensinal space using prjectin matrix R. Similarly, fr a classifier h IR n, h dentes its prjectin t a k dimensinal space via R, S dentes the set f pints which are the prjectins f the sample S, and ν j =(h ) T x j, the signed distance in the prjected space. 3. A Margin Distributin based Bund The decisin f the classifier h is based n the sign f ν(x)+b = h T x + b. Since bth h and x are nrmalized, ν(x) can be thught f as the gemetric distance between x and the hyperplane rthgnal t h that passes thrugh the rigin. Given a distributin n data pints x, this induces a distributin n their distance frm the hyperplane induced by h, which we refer t as the margin distributin. Nte that this is different frm the margin f the sample set S with respect t a classifier h, traditinally defined in the learning cmmunity as the distance f the pint which is clsest t the hyperplane. margin f S w.r.t. h is γ(s, h) = min m i=1 ht x i + b. In (Garg et al., 00), we have develped a data dependent margin bund fr the case when the classificatin is dne as ŷ = sign(ν(x)). Althugh theretically ne can redefine x [x, 1] and h [h, 1] and btain a similar result, it

3 turns ut that in practice this may nt be a gd idea. The bias term b is related t the distance f the hyperplane frm the rigin and h is the slpe f the hyperplane. Prving the bund, as well as develping the algrithmic apprach (based n Eqn. 3 belw), require cnsidering a prjected versin f the hyperplane int a lwer dimensinal space, k, in a way that des nt significantly effects the classificatin perfrmance. At times, hwever, the abslute value f b may be much larger than the individual cmpnents f h (even after nrmalizatin with the nrm f h). In these cases, b will cntribute mre t the nise assciated with the prjectin and it may nt be a gd idea t prject b. Due t this bservatin, the algrithmic apprach that is based n Eqn. 3 necessitates that we prve a slightly mdified versin f the result given in (Garg et al., 00). Therem 3.1 Let S = {(x 1,y 1 ),...,(x m,y m )} be a set f n-dimensinal labeled examples and h a linear classifier with bias term b. Then, fr all cnstants 0 <δ<1; 0 <k, with prbability at least 1 4δ, the expected errr f h is bunded by E(h) Ê(S, h)+min k with µ k redefined as µ k = mδ m j=1 µ k + (k +)ln me k+ +ln δ m () { ( min 3 exp (νj + ) } b) k, 8( + ν j ) k(ν j + b), 1. Nte that by chsing b =0, the abve result reduces t the ne given in (Garg et al., 00). Althugh ne can prve the abve therem fllwing the steps given in (Garg et al., 00), we give the sketch f the prf as it is instrumental in develping an intuitin fr the algrithm. The bund given in Eqn. has tw main cmpnents. The first cmpnent, µ k, captures the distrtin incurred due t the randm prjectin t dimensin k; the secnd term fllws directly frm VC thery fr a classifier in k dimensinal space. The randm prjectin therem (Arriaga & Vempala, 1999) states that relative distances are (almst) preserved when prjecting t lwer dimensinal space. Therefre, we first argue that, with high prbability the image, under prjectin, f data pints that are far frm h in the riginal space, will still be far in the prjected (k dimensinal) space. The first term quantifies the penalty incurred due t data pints whse images will nt be cnsistent with the image f h. That is, this term bunds the additinal errr incurred due t prjectin t k dimensinal space. Once the data lies in the lwer dimensinal space, we can bund the expected errr f the classifier n the data as a functin f the dimensin f the space, number f samples and the empirical errr there (that is, the first cmpnent). Decreasing the dimensin f the prjected space implies increasing the cntributin f the first term, while the VC-dimensin based term decreases. T get the ptimal bund, ne has t balance these tw quantities and chse the dimensin k f the prjected space s that the generalizatin errr is minimized. The fllwing lemma is the key in prving the abve therem; it quantifies the penalty incurred when prjecting the data dwn t k dimensinal space. Fr prf, please refer t (Garg et al., 00). Lemma 3. Let h be an n-dimensinal classifier, x IR n a sample pint, such that h = x =1, and ν = h T x. Let R IR k n be a randm prjectin matrix (Def..), with h = Rh, x = Rx. Let the classificatin is dne as sign(h T x + b). Then the prbability f misclassifying x, relative t its classificatin in the riginal space, due t the randm prjectin, is [ ] P sign(h T x + b) sign(h T x + b) { ( min 3exp (ν + ) b) k 8( + ν ), k(ν + b), 1 }. The abve lemma establishes a bund n the additinal classificatin errr that is incurred when prjecting the sample dwn frm an n t a k dimensinal space. Nte that in this frmulatin, unlike the ne in (Garg et al., 00)) we nly prject h while keeping b as in the riginal space. It can be shwn that in many cases this bund is tighter than the ne given in (Garg et al., 00). Once the abve result is established, ne can prve the therem by making use f the symmetrizatin lemma (Anthny & Bartlett, 1999) and standard VC-dimensin based arguments (Vapnik, 1998). See (Garg et al., 00) fr details Existing Learning Algrithms In this sectin, we attempt t prvide sme intuitin t the significance f cnsidering the margin distributin when selecting a classifier. We nte that the discussin is general althugh dne in the cntext f linear classifiers. It has been shwn that any classifier can be mapped t a linear classifier in high dimensinal space. This suggests that analysis f linear classifiers is general enugh and can give meaningful insights. The selectin f a classifier frm amng a set f classifiers that behaves well n the training data (with respect t sme cmplexity measure) is based Vapnik s (Vapnik, 1998) structural risk minimizatin principle. The key questin then is, what is the measure that shuld guide this selectin. A cartn example is depicted in Fig. 1. Training data with crrespnding t psitive data and crrespnding

4 (a) (b) (c) (d) Figure 1. (a) A cartn shwing the training data in tw dimensinal space with crrespnding t psitive class and crrespnding t negative class. (b) Linear classifiers that can classify the data withut errr. (c) The hyperplane that learned by a large margin classifier such as SVM. (d) This hyperplane may be a better chice than the ne learned by SVM, since mre data pints are further apart frm the hyperplane. t negative data. Clearly, there are a number f linear classifiers that can classify the training data perfectly. Fig. 1(b) shws sme f these classifiers. All these classifiers have zer errr n the training data and as such any selectin criteria based n training errr nly will nt distinguish between them. One f the ppular measure that guides the selectin f classifier is the margin f a hyperplane with respect t its training data. Clearly, hwever, this measure is sensitive since it typically depends n a small number f extreme pints. This can be significant nt nly in the case f nisy data, but als in cases f linearly separable data, since this selectin may effect perfrmance n future data. In the cntext f large margin classifiers such as SVM, researchers tried t get arund this prblem by intrducing slack variables and ignring sme f the clsest pints in an ad hc fashin, but this extreme ntin f margin still drives the ptimizatin. The cartn in Fig. 1(d) shws a hyperplane that was chsen by taking int accunt all the data pints rather than thse that are clsest t the hyperplane. Intuitively, this lks like a better chice than the ne in Fig. 1(c) (which wuld be chsen by a large margin classifier), by the virtue f being a mre stable measure. In the next sectin, we argue that ne can use prjectin prfile as a mdel selectin criterin and shw that this hyperplane is the ne that wuld actually be selected if this mdel selectin criteria is used. 4. Algrithmic Implicatins The expected prbability f errr fr a k-dimensinal image f a pint x that is at distance ν(x) frm an n- dimensinal hyperplane (where the expectatin is with respect t selecting a randm prjectin matrix) is given by the prjectin prfile in Eqn. 1. This expressin measures the cntributin f a pint t the generalizatin errr as a Dataset C:Margin-E C:PrjPrfile-E Pima Breast Cancer Insphere Credit Table 1. The first clumn (C:Margin-E) gives the crrelatin cefficient between the Margin f the training data and the errr n the test set. The secnd clumn (C:PrjPrfile-E) gives the crrelatin cefficient between the weighted margin accrding t the cst functin given in Eqn. 5 and the errr n test set; all dne fr fur datasets frm UCI ML Repsitry. When cmputing the margin we lk at nly crrectly classified pints. functin f its distance frm the hyperplane (fr a fixed k). Figure (a) shws this term (Eqn. 1) fr different values f k. All pints cntribute t the errr, and the relative cntributin f a pint decays as a functin f its distance frm the hyperplane. Hypthetically, given a prbability distributin ver the instance space and a fixed classifier, ne can cmpute the margin distributin which can then be used t cmpute the prjectin prfile f the data. The bund given in Thm. 3.1 depends nly n the prjectin prfile f the data and the prjectin dimensin and is independent f the particular classifier. Of all the classifiers with the same training errr, the ne with the least prjectin prfile k value will have the best generalizatin perfrmance. Therefre, the Therem suggests a mdel selectin criterin. Of all the classifiers, chse the ne that ptimizes the prjectin prfile and nt merely the margin. A similar argument, but fr a slightly restricted prblem, was presented by (Masn et al., 000). They argued that when learning a cnvex cmbinatin f classifiers, instead f maximizing the margin ne shuld maximize a weighted functin f the distance f the pints frm the

5 Expected Prjectin Errr K increases Weight Weight β K=1000 K=100 K= α Distance frm Hyperplane Margin Margin (a) (b) (c) Figure. (a)the cntributin f data pints t the generalizatin errr as a functin f their distance frm the hyperplane. (b) The cntributin f data pints t the generalizatin errr as a functin f their distance frm the hyperplane as given by (Masn et al., 000) (c) The weight given t the data pints by the MDO algrithm as a functin f there margin. learned classifier. This was used t develp margin distributin based generalizatin bund fr cnvex cmbinatin f classifiers. The weighting functin that was prpsed is shwn in Fig. (b). The results f their algrithm shwed that a weighted margin distributin (fr cnvex cmbinatin f classifiers) is a better cmplexity measure than the extreme ntin f margin which has been typically used in large margin classificatin algrithms like SVM. Althugh the prblem addressed there is different frm urs, interestingly, the frm f their weighting functin (shwn in Fig. (b), discussed later) is very similar t urs. In ur wrk, instead f wrking with the actual prjectin prfile, we wrk with a simpler frm f it given in Eqn. 3. u(x) = exp( (ν(x)+b) k) (3) The advantage f using the prjectin prfile as a mdel selectin criteria is evident frm the experiments presented in Table 1. In particular, we tk fur data sets frm UCI ML repsitry. Fr each data set, we randmly divided it int a test and a training data set. A classifier was learned n the training set and was evaluated n the test data. This experiment was repeated 00 times (each time chsing a different subset as training data). Each time we cmputed the actual margin f the SVM that was used as the initial guess and the weighted margin accrding t a cst functin, given in Eqn. 5, fr this classifier. Table 1 gives the crrelatin between the generalizatin errr f the classifier (its perfrmance n the test data) and these quantities. The first clumn in Table 1 gives the crrelatin between the errr n the test data and the margin n the training data. The secnd clumn gives the crrelatin between the errr n the test data and the weighted margin (prjectin prfile) cmputed n the training data. Nte that here we are lking at the abslute value f the crrelatin cefficient. The crrelatin results highlights the fact that ur new cst functin is a better predictr f generalizatin errr. Further evidence is presented in Table. It presents empirical evidence t shw that the cst functin given in Eqn. 5 is a mre stable measure than the margin. The first clumn in Table gives the crrelatin between the margin (which is cmputed as the distance f the clsest crrectly classified pint frm the hyperplane) n the training and test data ver 00 runs. The secnd clumn gives the crrelatin between the weighted margin (prjectin prfile) cmputed n the training data and test data. Again, a high crrelatin in the latter case indicates the stability f this new measure. 5. The Margin Distributin Optimizatin (MDO) Algrithm In this sectin, we intrduce a new learning algrithm, the Margin Distributin Optimizatin (MDO) algrithm. MDO is driven by ptimizing with respect t the simplified frm f prjectin prfile given in Eqn. 3. The learning algrithm attempts t find a classifier that minimizes a cst functin which happens t be the prjectin prfile f the data. The cst functin in Eqn. 3 determines the cntributin we want ur selected classifier t give t the data pints. In rder t chse a classifier that ptimizes with respect t this measure, thugh, we need t change it in rder t take int accunt misclassified pints. We d that by defining a new cst functin which explicitly gives higher weights t misclassified pints. Thus, the weight given t an example x i with true class label y i and learned hyperplane h with bias b is given by { e α(ν(xi)+b) if y i (ν(x i )+b) 0, W (x i )= e βyi(ν(xi)+b) (4) if y i (ν(x i )+b) < 0. That is, fr crrect classificatin, the weight is the ne given by the prjectin prfile. Hwever, fr misclassified

6 Dataset C:Margin Tr-Te C:PrjPrfile Tr-Te Pima Breast Cancer Insphere Credit Table. The first clumn (C:Margin Tr-Te) gives the crrelatin cefficient between the margin n the training data and the margin n the test data. The secnd clumn (C:PrjPrfile Tr-Te) gives the crrelatin cefficient between weighted margin accrding t cst functin given in Eqn. 5 n the training data and n the test data fr fur datasets frm UCI ML Repsitry. When cmputing the margin we lk at nly crrectly classified pints. pints, the example is expnentially weighted. The cnstants α, β allws ne t cntrl the weight f these tw terms. α shuld be thught f as the ptimal prjectin dimensin and culd be ptimized (althugh at this pint ur algrithms determines it heuristically). Fig. (c) shws a family f weight functins as a functin f the margin ν(x) +b fr an example. In this figure, a psitive margin means crrect classificatin and negative margin crrespnds t misclassificatin. The weight given t pints which were classified with gd margin is very lw while pints which are either misclassified r very clse t the margin get higher weights. Minimizing this cst functin will essentially try t maximize the weighted margin while reducing the misclassificatin errr. Nte the clse resemblance t the weighting functin prpsed by Masn et. al. fr the cnvex cmbinatin case, shwn in Fig. (b). Given a training set S f size m, the ptimizatin prblem can be frmulated as: Find (h, b) such that the fllwing cst functin is minimized. m L(h, b) = I(ŷ i = y i )e α(ν(xi)+b) i=1 + m I(ŷ i y i )e βyi(ν(xi)+b) (5) i=1 Here I( ) is an indicatr functin which is 1 when its argument is true therwise 0. α is directly prprtinal t the cncept f prjectin dimensin and β is a related t the tradeff between the misclassificatin errr and the prjectin prfile. There are a few bservatins that needs t be made befre we prceed further. In mst practical cases, learning is dne with nn-separable training data. Standard learning algrithms like SVM handle this case by intrducing slack variables and the ptimal values f the slack variables are chsen by a search prcedure. This makes the perfrmance f the algrithm dependent n the particular search methd. Our frmulatin autmatically takes care f misclassified pints. While minimizing the abve cst functin, it gives larger penalty t pints which are in errr as against the nes which are crrectly classified. The amunt f errr that will be tlerated can be cntrlled by chsing α, β apprpriately. One pssible way t chse these parameters is by evaluating the learned classifier ver sme hld ut set. In ur case, we bserved that in mst gave gd results where ν is an estimate f average margin given by ν = ν i /m fr a data set f size m fr sme h. cases α = 1 (ν+b) and β = 1 (ν+b) The main cmputatinal difficulty in ur algrithm lies in the nn-cnvexity f the bjective functin. In general, ne wuld like t avid slving a nn-cnvex prblem as it is hard t btain a glbally ptimal slutin and there is n clsed frm expressin. Nevertheless we shw that gradient descent methds are quiet useful here. Due t nncnvexity, there are a number f lcal saddle pints and ne may nt reach a glbal ptimal pint. This makes the chice f starting pint extremely critical. In ur wrk, we suggest t use SVM learning algrithm t chse the initial starting classifier. Once an initial classifier is btained, the algrithm uses gradient descent ver the cst functin given in Eqn 5. The algrithm stps nce it has reached a lcal minima f the cst functin. Tw imprtant pints that deserves attentin are (1) the nrm f the classifier h and () the nn-differentiability f the cst functin. In general, by increasing the nrm f the classifier, we can decrease the value f the cst functin (as the margin will increase). Therefre, we dn t take int accunt the nrm f the classifier h in the cst functin and at each iteratin, after btaining the updated vectr h, we re-nrmalize it. The secnd issue is mre tricky. The derivative f the cst functin is nt defined fr thse x i which have 0 margin i.e. ν(x i )+b =0. At such pints, we cmpute bth the left and right derivatives. All the gradient directins are then evaluated and thse which lead t maximum decrease in the cst functin is selected. The algrithm terminates if n such gradient directin are fund. The algrithm given in Fig. 7 gives the implementatin details. It starts with learning a classifier using SVM learning algrithm btaining (h, b). We assume that h =1. α, β are then initialized (as abve). MAX CONST - maximum number f pints with zer margin that will be cnsidered at any given iteratin f the algrithm. Since at pints with zer margin, we are evaluating bth the left and right derivative, the number f directins cnsidered is expnential in the number f pints with zer margin, and as such we limit the number f such pints (fr which bth gradient directins are cnsidered) t MAX CONST. MDO is the main prcedure. If there are n pints with 0 If h 1, define h = h/ h, b= b/ h. The classificatin result will be unchanged.

7 7 x Prjectin Prfile Margin Training Errr Test Errr Misclassificatin Errr Iteratins f Gradient Descent Figure 3. The plt gives the experimental results fr the breast cancer dataset frm the UCI ML repsitry. It gives the prjectin prfile and margin f the learned classifier n the training data as the algrithm iterates. margin, the algrithm simply cmputes the gradients as: L(h, b) h k = α β m I(ŷ i = y i)(ν(x i)+b)x k i e α(ν(x i)+b) i=1 m I(ŷ i y i)y ix k i e βy i(ν(x i )+b) i=1 These derivatives are then used by prcedure Update t btain the updated values f (h, b) given by (h,b ). While updating, we make sure that cst functin is decreasing at each iteratin. Hwever, if there are pints with 0 margin, we randmly select f MAX CONST such pints, and at thse we cmpute all the pssible gradients (cmputing bth left and right derivatives). The ther pints with zer margin are ignred. Once we have the derivative f the cst functin, the same update prcedure, as abve, is called. While ding the experiments, we bserved that the algrithm typically cnverges in a small number f steps. 6. Results In this sectin, we prvide the empirical cmparisn f the MDO with SVM. Nte that althugh, in ur experiments, MDO s starting pint is the SVM classifier, the results btained by MDO culd be wrse than SVM. Indeed, the selectin f the new classifier is driven by ur new ptimizatin functin and it culd lead t larger errr. Interestingly, when ding the experiments we bserved that in many cases we btained better generalizatin at the cst f slightly pr perfrmance n the training data. T understand this we analyzed in detail the results fr the breast cancer dataset frm UCI ML repsitry. In Fig. 3 we plt the margin and the value f the cst functin given in Eqn. 5 (6) Iteratins f Gradient Descent Figure 4. The plt gives the experimental results fr the breast cancer dataset frm the UCI ML repsitry. This plt gives the training errr and test errr at each iteratin. With the iteratins, training errr increased frm the starting pint - 0th iteratin while test errr went dwn. Datasets SVM MDO % Reductin Credit (UCI) Breast (UCI) Insphere (UCI) Heart (UCI) Liver (UCI) Mushrm (UCI) peace & piece being & begin Table 3. Results cmparing the perfrmance f SVM and MDO. The first six datasets are frm UCI ML repsitry. The last tw datasets are frm the cntext sensitive spelling crrectin prblem. The clumn SVM gives the % crrect classificatin fr test data when a classifier learned using SVM was used and clumn MDO gives the same fr MDO. The last clumn gives the percentage reductin in errr frm SVM t MDO. As seen, in all cases, MDO s results are cmparable r better than the nes btained using SVM. These results are averaged ver 100 runs. ver the gradient descent iteratins. As expected, while the cst functin f MDO imprved (decreased in value) the margin deterirated (since we started with SVM, the margin at 0 th iteratin is maximal). Fig. 4 is even mre revealing. We shw hw the training and test errr changed with the iteratins f the gradient descent. Interestingly, the errr n the training errr went up frm 5% at 0 th iteratin t 9%. The algrithm traded the increase in training errr with the decrease in prjectin prfile. The secnd plt gives the test errr and as evident, the test errr went dwn with mre iteratins. The gap between the training errr and test errr als decreased which shws that the new classifier generalized well and the cst functin prpsed in Eqn. 5 is a better measure f learning cmplexity.

8 In Table 3, we cmpare the classificatin perfrmance f MDO and SVM n a number f datasets frm UCI ML repsitry and n tw real wrld prblems related t cntext sensitive spelling crrectin. The spelling data is taken frm the pruned data set f (Glding & Rth, 1999). Tw particular examples f cntext sensitive spelling crrectin, being & begin and peace & piece, are cnsidered. Fr details n the dataset see (Murphy, 1994). In all experiments, data was randmly divided int training ( 60%), and test set (40%). The experiments were repeated 100 times each time chsing a different subset (randmly chsen) f training and test data. It is evident that except fr a few case (in which the imprvement is nt statistically significant), in mst cases MDO utperfrms SVM. 7. Cnclusins We presented a new, practical, margin distributin based cmplexity measure fr learning classifiers that we derived by enhancing a recently develped methd fr develping margin distributin based generalizatin bunds. We have shwn that this theretically justified cmplexity measure can be used rbustly as a mdel selectin criterin and, mst imprtantly, used it t drive a new learning algrithm fr linear classifiers, that is selected by ptimizing with respect t this cmplexity measure. The results are based n a nvel use f the margin distributin f the data relative t the learned classifier, that is different than the typical use f the ntin f margin in machine learning. Cnsequently, the resulting algrithm is nt sensitive t small number f samples in determining the ptimal hyperplane. Althugh the bund presented here is tighter than existing bunds and is smetimes infrmative fr real prblems, it is still lse and mre research is needed t match bserved perfrmance n real data. Algrithmically, althugh we have given an implementatin f the new algrithm MDO, ne f the main directin f future research is t study it further, as well as t investigate ther algrithmic implicatins f the ideas presented here. Acknwledgments: This research is supprted by NSF grants ITR-IIS , ITR-IIS and IIS and an IBM fellwship t Ashutsh Garg. References Anthny, M., & Bartlett, P. L. (1999). Neural netwrk learning: Theretical fundatins. Cambridge. Arriaga, R. I., & Vempala, S. (1999). An algrithmic thery f learning: Rbust cncepts and randm prjectin. Prc. f the 40th Fundatins f Cmputer Science (pp ). Garg, A., Har-Peled, S., & Rth, D. (00). On generalizatin bunds, prjectin prfile, and margin distributin. Prc. f the Internatinal Cnference n Machine Learning. Here we initialize the cnstants and btain initial estimate f the classifier Glbal : Training Set: S = {(x 1,y 1),..., (x m,y m)} Classifier: Learn (h, b) using linear SVM Where h, b : y i(h T x i + b) > 0 (fr all the crrectly classified training examples.) Cnstants: MAX CONST,α,β,c inf This is the main functin Prcedure MDO(h, b, S) Fr all (x i,y i) S, f h (x i)=y i(h T x i + b) C(h, b) = i:f h (x i ) 0 e αf h (x i ) + i:f h (x i )<0 e βf h (x i ) A := {(x i,y i) S : f h (x i)=0} (pints with 0 margin) if (A = φ) N pints with 0 margin g h := h C(h, b); g b := b C(h, b) Update (h,b,h,b,g h,g b ) else A is a set f pints with 0 margin if ( A >MAX CONST) then B is randm subset f A f size MAX CONST, else B = A N = B fr each (b 1,..., b N ) {±1} N {g (b 1,..,b N ) h,g (b 1,..,b N ) b } = { (b 1,..,b N ) h, (b 1,..,b N ) b }C(h, b); where (b 1,...,b N ) means the left r right derivative at the i th example in B accrding as b i =+1r 1. Update (h,b,h,b,g h,g b ) c (b 1,...,b N ) = C(h, b) C(h,b ) end Chse h,b which had the maximum reductin in cst end if C(h, b) C(h,b ) <c inf, then BREAK else h = h ; b = b ; MDO(h, b, S) End This prcedure gives the updated parameters based n the ld parameters and gradients Prcedure Update(h,b,h,b,g h,g b ) h = h + ɛg h,b = b + ɛg b ; h = h / h, b = b / h where ɛ is maximum psitive real number such that C(h, b) C(h,b ) 0. end Figure 5. Sketch f the Margin Distributin Optimizatin (MDO) Algrithm Glding, A. R., & Rth, D. (1999). A Winnw based apprach t cntext-sensitive spelling crrectin. Machine Learning, 34, Jhnsn, W. B., & Lindenstrauss, J. (1984). Extensins f lipschitz mappings int a hilbert space. Cnference in mdern analysis and prbability (pp ). Langfrd, J., & Shawe-Taylr, J. (00). Pac-Bayes and margins. Advances in Neural Infrmatin Prcessing Systems. Masn, L., Bartlett, P., & Baxter, J. (000). Imprved generalizatin thrugh explicit ptimizatin f margins. Machine Learning, 38-3, Murphy, P. M. (1994). UCI repsitry f machine learning databases [ mlearn] (Technical Reprt). Irvine, CA: University f Califrnia.

9 Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1997). Bsting the margin: a new explanatin fr the effectiveness f vting methds. Prc. 14th Internatinal Cnference n Machine Learning (pp ). Mrgan Kaufmann. Vapnik, V. N. (1998). Statistical learning thery. New Yrk: Jhn Wiley and Sns Inc.

Pattern Recognition 2014 Support Vector Machines

Pattern Recognition 2014 Support Vector Machines Pattern Recgnitin 2014 Supprt Vectr Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 1 / 55 Overview 1 Separable Case 2 Kernel Functins 3 Allwing Errrs (Sft