The University of Auckland, School of Engineering SCHOOL OF ENGINEERING REPORT 616 SUPPORT VECTOR MACHINES BASICS. written by.

Size: px

Start display at page:

Download "The University of Auckland, School of Engineering SCHOOL OF ENGINEERING REPORT 616 SUPPORT VECTOR MACHINES BASICS. written by."

Allison Rosa Cain
6 years ago
Views:

1 The Unversty of Auckand, Schoo of Engneerng SCHOOL OF ENGINEERING REPORT 66 SUPPORT VECTOR MACHINES BASICS wrtten by Vojsav Kecman Schoo of Engneerng The Unversty of Auckand Apr, 004

2 Vojsav Kecman Copyrght, Vojsav Kecman

3 Contents Abstract Bascs of earnng from data Support Vector Machnes n Cassfcaton and Regresson. Lnear Maxma Margn Cassfer for Lneary Separabe Data. Lnear Soft Margn Cassfer for Overappng Casses.3 The Nonnear Cassfer 5.4 Regresson by Support Vector Machnes 36 3 Impementaton Issues 46 Appendx 48 L Support Vector Machnes Modes Dervaton 48 L Soft Margn Cassfer 49 L Soft Regressor 50 References 53

5 Support Vector Machnes Bascs An Introducton Ony Vojsav Kecman The Unversty of Auckand, Auckand, New Zeaand Abstract. Ths s a booket about (machne) earnng from emprca data (.e., exampes, sampes, measurements, records, patterns or observatons) by appyng support vector machnes (SVMs) a.k.a. kerne machnes. The basc am of ths report s to gve, as far as possbe, a condensed (but systematc) presentaton of a nove earnng paradgm emboded n SVMs. Our focus w be on the constructve earnng agorthms for both the cassfcaton (pattern recognton) and regresson (functon approxmaton) probems. Consequenty, we w not go nto a the subtetes and detas of the statstca earnng theory (SLT) and structura rsk mnmzaton (SRM) whch are theoretca foundatons for the earnng agorthms presented beow. Ths seems more approprate for the appcaton orented readers. The theoretcay mnded and nterested reader may fnd an extensve presentaton of both the SLT and SRM n (Vapnk, 995, 998; Cherkassky and Muer, 998; Crstann and Shawe-Tayor, 00; Kecman, 00; Schökopf and Smoa 00). Instead of dvng nto a theory, a quadratc programmng based earnng eadng to parsmonous SVMs w be presented n a gente way - startng wth near separabe probems, through the cassfcaton tasks havng overapped casses but st a near separaton boundary, beyond the nearty assumptons to the nonnear separaton boundary, and fnay to the near and nonnear regresson probems. Here, the adjectve parsmonous denotes a SVM wth a sma number of support vectors ( hdden ayer neurons ). The scarcty of the mode resuts from a sophstcated, QP based, earnng that matches the mode capacty to the data compexty ensurng a good generazaton,.e., a good performance of SVM on the future, prevousy, durng the tranng unseen, data. Same as the neura networks (or smary to them), SVMs possess the we-known abty of beng unversa approxmators of any mutvarate functon to any desred degree of accuracy. Consequenty, they are of partcuar nterest for modeng the unknown, or partay known, hghy nonnear, compex systems, pants or processes. Aso, at the very begnnng, and just to be sure what the whoe booket s about, we shoud state ceary when there s no need for an appcaton of SVMs mode-budng technques. In short, whenever there exsts an anaytca cosed-form mode (or t s possbe to devse one) there s no need to resort to earnng from emprca data by SVMs (or by any other type of a earnng machne).

6 Vojsav Kecman Bascs of earnng from data SVMs have been deveoped n the reverse order to the deveopment of neura networks (NNs). SVMs evoved from the sound theory to the mpementaton and experments, whe the NNs foowed more heurstc path, from appcatons and extensve expermentaton to the theory. It s nterestng to note that the very strong theoretca background of SVMs dd not make them wdey apprecated at the begnnng. The pubcaton of the frst papers by Vapnk, Chervonenks (Vapnk and Chervonenks, 965) and co-workers went argey unnotced t 99. Ths was due to a wdespread beef n the statstca and/or machne earnng communty that, despte beng theoretcay appeang, SVMs are nether sutabe nor reevant for practca appcatons. They were taken serousy ony when exceent resuts on practca earnng benchmarks were acheved (n numera recognton, computer vson and text categorzaton). Today, SVMs show better resuts than (or comparabe outcomes to) NNs and other statstca modes, on the most popuar benchmark probems. The earnng probem settng for SVMs s as foows: there s some unknown and nonnear dependency (mappng, functon) y = f(x) between some hgh-dmensona nput vector x and scaar output y (or the vector output y as n the case of mutcass SVMs). There s no nformaton about the underyng jont probabty functons here. Thus, one must perform a dstrbuton-free earnng. The ony nformaton avaabe s a tranng data set D = ^(x, y ) X x Y`, =,, where stands for the number of the tranng data pars and s therefore equa to the sze of the tranng data set D. Often, y s denoted as d, where d stands for a desred (target) vaue. Hence, SVMs beong to the supervsed earnng technques. Note that ths probem s smar to the cassc statstca nference. However, there are severa very mportant dfferences between the approaches and assumptons n tranng SVMs and the ones n cassc statstcs and/or NNs modeng. Cassc statstca nference s based on the foowng three fundamenta assumptons:. Data can be modeed by a set of near n parameter functons; ths s a foundaton of a parametrc paradgm n earnng from expermenta data.. In the most of rea-fe probems, a stochastc component of data s the norma probabty dstrbuton aw, that s, the underyng jont probabty dstrbuton s a Gaussan dstrbuton. 3. Because of the second assumpton, the nducton paradgm for parameter estmaton s the maxmum kehood method, whch s reduced to the mnmzaton of the sum-of-errors-squares cost functon n most engneerng appcatons. A three assumptons on whch the cassc statstca paradgm reed turned out to be napproprate for many contemporary rea-fe probems (Vapnk, 998) because of the foowng facts:

7 Support Vector Machnes Bascs An Introducton Ony 3. Modern probems are hgh-dmensona, and f the underyng mappng s not very smooth the near paradgm needs an exponentay ncreasng number of terms wth an ncreasng dmensonaty of the nput space X (an ncreasng number of ndependent varabes). Ths s known as the curse of dmensonaty.. The underyng rea-fe data generaton aws may typcay be very far from the norma dstrbuton and a mode-buder must consder ths dfference n order to construct an effectve earnng agorthm. 3. From the frst two ponts t foows that the maxmum kehood estmator (and consequenty the sum-of-error-squares cost functon) shoud be repaced by a new nducton paradgm that s unformy better, n order to mode non-gaussan dstrbutons. In addton to the three basc objectves above, the nove SVMs probem settng and nductve prncpe have been deveoped for standard contemporary data sets whch are typcay hgh-dmensona and sparse (meanng, the data sets contan sma number of the tranng data pars). SVMs are the so-caed nonparametrc modes. Nonparametrc does not mean that the SVMs modes do not have parameters at a. On the contrary, ther earnng (seecton, dentfcaton, estmaton, tranng or tunng) s the cruca ssue here. However, unke n cassc statstca nference, the parameters are not predefned and ther number depends on the tranng data used. In other words, parameters that defne the capacty of the mode are data-drven n such a way as to match the mode capacty to data compexty. Ths s a basc paradgm of the structura rsk mnmzaton (SRM) ntroduced by Vapnk and Chervonenks and ther coworkers that ed to the new earnng agorthm. Namey, there are two basc constructve approaches possbe n desgnng a mode that w have a good generazaton property (Vapnk, 995 and 998):. choose an approprate structure of the mode (order of poynomas, number of HL neurons, number of rues n the fuzzy ogc mode) and, keepng the estmaton error (a.k.a. confdence nterva, a.k.a. varance of the mode) fxed n ths way, mnmze the tranng error (.e., emprca rsk), or. keep the vaue of the tranng error (a.k.a. an approxmaton error, a.k.a.an emprca rsk) fxed (equa to zero or equa to some acceptabe eve), and mnmze the confdence nterva. Cassc NNs mpement the frst approach (or some of ts sophstcated varants) and SVMs mpement the second strategy. In both cases the resutng mode shoud resove the tradeoff between under-fttng and over-fttng the tranng data. The fna mode structure (ts order) shoud deay match the earnng machnes capacty wth tranng data compexty. Ths mportant dfference n two earnng approaches comes from the mnmzaton of dfferent cost (error, oss) functonas. Tabe tabuates the basc rsk functonas apped n deveopng the three contemporary statstca modes.

8 4 Vojsav Kecman Tabe : Basc Modes and Ther Error (Rsk) Functonas Mutayer perceptron (NN) Reguarzaton Network (Rada Bass Functons Network) Support Vector Machne R ( d f( x, w)) ÈÉ Coseness to data R ( d f( x, w)) O Pf ÈÉ Coseness to data ÈÉ Smoothness R L :(, h) H N Coseness to data Capacty of a machne Coseness to data = tranng error, a.k.a. emprca rsk d stands for desred vaues, w s the weght vector subject to tranng, O s a reguarzaton parameter, P s a smoothness operator, L H s a SVMs oss functon, h s a VC dmenson and : s a functon boundng the capacty of the earnng machne. In cassfcaton probems L H s typcay 0- oss functon, and n regresson probems L H s the so-caed Vapnk s H-nsenstvty oss (error) functon L = y - f( x, w) H H 0, f y - f( x, w) dh y - f( xw, ) - H, otherwse. () where H s a radus of a tube wthn whch the regresson functon must e, after the successfu earnng. (Note that for H = 0, the nterpoaton of tranng data w be performed). It s nterestng to note that (Gros, 997) has shown that under some constrants the SV machne can aso be derved from the framework of reguarzaton theory rather than SLT and SRM. Thus, unke the cassc adaptaton agorthms (that work n the L norm), SV machnes represent nove earnng technques whch perform SRM. In ths way, the SV machne creates a mode wth mnmzed VC dmenson and when the VC dmenson of the mode s ow, the expected probabty of error s ow as we. Ths means good performance on prevousy unseen data,.e. a good generazaton. Ths property s of partcuar nterest because the mode that generazes we s a good mode and not the mode that performs we on tranng data pars. Too good a performance on tranng data s aso known as an extremey undesrabe overfttng. As t w be shown beow, n the smpest pattern recognton tasks, support vector machnes use a near separatng hyperpane to create a cassfer wth a maxma margn. In order to do that, the earnng probem for the SV machne w be cast as a constraned nonnear optmzaton probem. In ths settng the cost functon w be quadratc and the constrants near (.e., one w have to sove a cassc quadratc programmng probem). In cases when gven casses cannot be neary separated n the orgna nput space, the SV machne frst (non-neary) transforms the orgna nput space nto a hgher dmensona feature space. Ths transformaton can be acheved by usng varous nonnear mappngs; poynoma, sgmoda as n mutayer perceptrons, RBF mappngs havng as the bass functons raday symmetrc functons such as Gaussans, or mutquadrcs or dffer-

9 Support Vector Machnes Bascs An Introducton Ony 5 ent spne functons. After ths nonnear transformaton step, the task of a SV machne n fndng the near optma separatng hyperpane n ths feature space s reatvey trva. Namey, the optmzaton probem to sove n a feature space w be of the same knd as the cacuaton of a maxma margn separatng hyperpane n orgna nput space for neary separabe casses. How, after the specfc nonnear transformaton, nonneary separabe probems n nput space can become neary separabe probems n a feature space w be shown ater. In a probabstc settng, there are three basc components n a earnng from data tasks: a generator of random nputs x, a system whose tranng responses y (.e., d) are used for tranng the earnng machne, and a earnng machne whch, by usng nputs x and system s responses y, shoud earn (estmate, mode) the unknown dependency between these two sets of varabes defned by the weght vector w (Fg ). The earnng (tranng) phase Learnng Machne w = w(x, y) w x Ths connecton s present ony durng the earnng phase. System.e., Pant y.e., d The appcaton (generazaton or, test) phase: x Learnng Machne o = f a (x, w) ~ y o Fgure A mode of a earnng machne (top) w = w(x, y) that durng the tranng phase (by observng nputs x to, and outputs y from, the system) estmates (earns, adjusts, trans, tunes) ts parameters (weghts) w, and n ths way earns mappng y = f(x, w) performed by the system. The use of f a (x, w) ~ y denotes that we w rarey try to nterpoate tranng data pars. We woud rather seek an approxmatng functon that can generaze we. After the tranng, at the generazaton or test phase, the output from a machne o = f a (x, w) s expected to be a good estmate of a system s true response y. The fgure shows the most common earnng settng that some readers may have aready seen n varous other feds - notaby n statstcs, NNs, contro system dentfcaton and/or

10 6 Vojsav Kecman n sgna processng. Durng the (successfu) tranng phase a earnng machne shoud be abe to fnd the reatonshp between an nput space X and an output space Y, by usng data D n regresson tasks (or to fnd a functon that separates data wthn the nput space, n cassfcaton ones). The resut of a earnng process s an approxmatng functon f a (x, w), whch n statstca terature s aso known as, a hypothess f a (x, w). Ths functon approxmates the underyng (or true) dependency between the nput and output n the case of regresson, and the decson boundary,.e., separaton functon, n a cassfcaton. The chosen hypothess f a (x, w) beongs to a hypothess space of functons H (f a H), and t s a functon that mnmzes some rsk functona R(w). It may be practca to remnd the reader that under the genera name approxmatng functon we understand any mathematca structure that maps nputs x nto outputs y. Hence, an approxmatng functon may be: a mutayer perceptron NN, RBF network, SV machne, fuzzy mode, Fourer truncated seres or poynoma approxmatng functon. Here we dscuss SVMs. A set of parameters w s the very subject of earnng and generay these parameters are caed weghts. These parameters may have dfferent geometrca and/or physca meanngs. Dependng upon the hypothess space of functons + we are workng wth the parameters w are usuay: - the hdden and the output ayer weghts n mutayer perceptrons, - the rues and the parameters (for the postons and shapes) of fuzzy subsets, - the coeffcents of a poynoma or Fourer seres, - the centers and (co)varances of Gaussan bass functons as we as the output ayer weghts of ths RBF network, - the support vector weghts n SVMs. There s another mportant cass of functons n earnng from exampes tasks. A earnng machne tres to capture an unknown target functon f o (x) that s beeved to beong to some target space T, or to a cass T, that s aso caed a concept cass. Note that we rarey know the target space T and that our earnng machne generay does not beong to the same cass of functons as an unknown target functon f o (x). Typca exampes of target spaces are contnuous functons wth s contnuous dervatves n n varabes; Soboev spaces (comprsng square ntegrabe functons n n varabes wth s square ntegrabe dervatves), band-mted functons, functons wth ntegrabe Fourer transforms, Booean functons, etc. In the foowng, we w assume that the target space T s a space of dfferentabe functons. The basc probem we are facng stems from the fact that we know very tte about the possbe underyng functon between the nput and the output varabes. A we have at our dsposa s a tranng data set of abeed exampes drawn by ndependenty sampng a (X x Y) space accordng to some unknown probabty dstrbuton. The earnng-from-data probem s -posed. (Ths w be shown on Fgs and 3 for a regresson and cassfcaton exampes respectvey). The basc source of the -posedness of the probem s due to the nfnte number of possbe soutons to the earnng probem.

11 Support Vector Machnes Bascs An Introducton Ony 7 At ths pont, just for the sake of ustraton, t s usefu to remember that a functons that nterpoate data ponts w resut n a zero vaue for tranng error (emprca rsk) as shown (n the case of regresson) n Fg. The fgure shows a smpe exampe of three-outof-nfntey-many dfferent nterpoatng functons of tranng data pars samped from a noseess functon y = sn(x)..5 f(x) Three dfferent nterpoatons of the nose-free tranng data samped from a snus functon (sod thn ne) y = f(x ) x x Fgure Three-out-of-nfntey-many nterpoatng functons resutng n a tranng error equa to 0. However, a thck sod, dashed and dotted nes are bad modes of a true functon y = sn(x) (thn dashed ne). In Fg, each nterpoant resuts n a tranng error equa to zero, but at the same tme, each one s a very bad mode of the true underyng dependency between x and y, because a three functons perform very poory outsde the tranng nputs. In other words, none of these three partcuar nterpoants can generaze we. However, not ony nterpoatng functons can msead. There are many other approxmatng functons (earnng machnes) that w mnmze the emprca rsk (approxmaton or tranng error) but not necessary the generazaton error (true, expected or guaranteed rsk). Ths foows from the fact that a earnng machne s traned by usng some partcuar sampe of the true underyng functon and consequenty t aways produces based approxmatng functons. These approxmants depend necessary on the specfc tranng data pars (.e., the tranng sampe) used. Fg 3 shows an extremey smpe cassfcaton exampe where the casses (represented by the empty tranng crces and squares) are neary separabe. However, n addton to a near separaton (dashed ne) the earnng was aso performed by usng a mode of a hgh capacty (say, the one wth Gaussan bass functons, or the one created by a hgh order poynoma, over the -dmensona nput space) that produced a perfect separaton boundary (emprca rsk equas zero) too. However, such a mode s overfttng the data and t w defntey perform very bady on, durng the tranng unseen, test exampes. Fed crces and squares n the rght hand graph are a wrongy cassfed by the nonnear mode.

12 8 Vojsav Kecman Note that a smpe near separaton boundary correcty cassfes both the tranng and the test data. x x The parameter h s caed the VC (Vapnk-Chervonenks) dmenson of a set of functons. It descrbes the capacty of a set of functons mpemented n a earnng machne. For a bx x Fgure 3 Overfttng n the case of neary separabe cassfcaton probem. Left: The perfect cassfcaton of the tranng data (empty crces and squares) by both ow order near mode (dashed ne) and hgh order nonnear one (sod wggy curve). Rght: Wrong cassfcaton of a the test data shown (fed crces and squares) by a hgh capacty mode, but correct one by the smpe near separaton boundary. A souton to ths probem proposed n the framework of the SLT s restrctng the hypothess space H of approxmatng functons to a set smaer than that of the target functon T whe smutaneousy controng the fexbty (compexty) of these approxmatng functons. Ths s ensured by an ntroducton of a nove nducton prncpe of the SRM and ts agorthmc reazaton through the SV machne. The Structura Rsk Mnmzaton prncpe (Vapnk, 979) tres to mnmze an expected rsk (the cost functon) R comprsng two terms as gven n Tabe for the SVMs R :(, h) L :(, h) R H emp and t s based on the fact that for the cassfcaton earnng probem wth a probabty of at east - K the bound h n( K) R( wn) d: (, ) Remp( w n), (a) hods. The frst term on the rght hand sde s named a VC confdence (confdence term or confdence nterva) that s defned as ª º K h n( ) n( ) h n( K) «h» 4 :(, ) ¼ (b)

13 Support Vector Machnes Bascs An Introducton Ony 9 nary cassfcaton h s the maxma number of ponts whch can be separated (shattered) nto two casses n a possbe h ways by usng the functons of the earnng machne. A SV (earnng) machne can be thought of as o a set of functons mpemented n a SVM, o an nducton prncpe and, o an agorthmc procedure for mpementng the nducton prncpe on the gven set of functons. The notaton for rsks gven above by usng R(w n ) denotes that an expected rsk s cacuated over a set of functons f an (x, w n ) of ncreasng compexty. Dfferent bounds can aso be formuated n terms of other concepts such as growth functon or anneaed VC entropy. Bounds aso dffer for regresson tasks. More deta can be found n (Vapnk, 995, as we as n Cherkassky and Muer, 998). However, the genera characterstcs of the dependence of the confdence nterva on the number of tranng data and on the VC dmenson h s smar and gven n Fg 4. :(h,, K = 0.) VC Confdence VC confdence.e., or Estmaton error bound Error Number of data VC dmenson h 00 Fgure 4 The dependency of VC confdence nterva :(h,, K) on the number of tranng data and the VC dmenson h (h < ) for a fxed confdence eve - K = 0. = Equatons () show that when the number of tranng data ncreases,.e., for o f (wth other parameters fxed), an expected (true) rsk R(w n ) s very cose to emprca rsk R emp (w n ) because : o 0. On the other hand, when the probabty - K (aso caed a confdence eve whch shoud not be confused wth the confdence term :) approaches, the generazaton bound grows arge, because n the case when K o 0 (meanng that the confdence eve - K o ), the vaue of : o f. Ths has an obvous ntutve nterpretaton (Cherkassky and Muer, 998) n that any earnng machne (mode, estmates) obtaned from a fnte number of tranng data cannot have an arbtrary hgh confdence eve.

14 0 Vojsav Kecman There s aways a trade-off between the accuracy provded by bounds and the degree of confdence (n these bounds). Fg 4 aso shows that the VC confdence nterva ncreases wth an ncrease n a VC dmenson h for a fxed number of the tranng data pars. The SRM s a nove nductve prncpe for earnng from fnte tranng data sets. It proved to be very usefu when deang wth sma sampes. The basc dea of the SRM s to choose (from a arge number of possby canddate earnng machnes), a mode of the rght capacty to descrbe the gven tranng data pars. As mentoned, ths can be done by restrctng the hypothess space H of approxmatng functons and smutaneousy by controng ther fexbty (compexty). Thus, earnng machnes w be those parameterzed modes that, by ncreasng the number of parameters (typcay caed weghts w here), form a nested structure n the foowng sense H H H 3 H n - H n H (3) In such a nested set of functons, every functon aways contans a prevous, ess compex, functon. Typcay, H n may be: a set of poynomas n one varabe of degree n; fuzzy ogc mode havng n rues; mutayer perceptrons, or RBF network havng n HL neurons, SVM structured over n support vectors. The goa of earnng s one of a subset seecton that matches tranng data compexty wth approxmatng mode capacty. In other words, a earnng agorthm chooses an optma poynoma degree or, an optma number of HL neurons or, an optma number of FL mode rues, for a poynoma mode or NN or FL mode respectvey. For earnng machnes near n parameters, ths compexty (expressed by the VC dmenson) s gven by the number of weghts,.e., by the number of free parameters. For approxmatng modes nonnear n parameters, the cacuaton of the VC dmenson s often not an easy task. Nevertheess, even for these networks, by usng smuaton experments, one can fnd a mode of approprate compexty.

15 Lnear Maxma Margn Cassfer for Lneary Separabe Data Support Vector Machnes n Cassfcaton and Regresson Beow, we focus on the agorthm for mpementng the SRM nducton prncpe on the gven set of functons. It mpements the strategy mentoned prevousy t keeps the tranng error fxed and mnmzes the confdence nterva. We frst consder a smpe exampe of near decson rues (.e., the separatng functons w be hyperpanes) for bnary cassfcaton (dchotomzaton) of neary separabe data. In such a probem, we are abe to perfecty cassfy data pars, meanng that an emprca rsk can be set to zero. It s the easest cassfcaton probem and yet an exceent ntroducton of a reevant and mportant deas underyng the SLT, SRM and SVM. Our presentaton w graduay ncrease n compexty. It w begn wth a Lnear Maxma Margn Cassfer for Lneary Separabe Data where there s no sampe overappng. Afterwards, we w aow some degree of overappng of tranng data pars. However, we w st try to separate casses by usng near hyperpanes. Ths w ead to the Lnear Soft Margn Cassfer for Overappng Casses. In probems when near decson hyperpanes are no onger feasbe, the mappng of an nput space nto the so-caed feature space (that corresponds to the HL n NN modes) w take pace resutng n the Nonnear Cassfer. Fnay, n the subsecton on Regresson by SV Machnes we ntroduce same approaches and technques for sovng regresson (.e., functon approxmaton) probems.. Lnear Maxma Margn Cassfer for Lneary Separabe Data Consder the probem of bnary cassfcaton or dchotomzaton. Tranng data are gven as (x, y ), (x, y ),..., (x, y ), x ƒ n, y {+, -} (4) For reasons of vsuazaton ony, we w consder the case of a two-dmensona nput space,.e., x ƒ. Data are neary separabe and there are many dfferent hyperpanes that can perform separaton (Fg 5). (Actuay, for x ƒ, the separaton s performed by panes w x + w x + b = o. In other words, the decson boundary,.e., the separaton ne n nput space s defned by the equaton w x + w x + b = 0.). How to fnd the best one? The dffcut part s that a we have at our dsposa are sparse tranng data. Thus, we want to fnd the optma separatng functon wthout knowng the underyng probabty dstrbuton P(x, y). There are many functons that can sove gven pattern recognton (or functona approxmaton) tasks. In such a probem settng, the SLT (deveoped n the eary 960s by Vapnk and Chervonenks) shows that t s cruca to restrct the cass of functons mpemented by a earnng machne to one wth a compexty that s sutabe for the amount of avaabe tranng data.

16 Vojsav Kecman In the case of a cassfcaton of neary separabe data, ths dea s transformed nto the foowng approach among a the hyperpanes that mnmze the tranng error (.e., emprca rsk) fnd the one wth the argest margn. Ths s an ntutvey acceptabe approach. Just by ookng at Fg 5 we w fnd that the dashed separaton ne shown n the rght graph seems to promse probaby good cassfcaton whe facng prevousy unseen data (meanng, n the generazaton,.e. test, phase). Or, at east, t seems to probaby be better n generazaton than the dashed decson boundary havng smaer margn shown n the eft graph. Ths can aso be expressed as that a cassfer wth smaer margn w have hgher expected rsk. By usng gven tranng exampes, durng the earnng stage, our machne fnds parameters w = [w w w n ] T and b of a dscrmnant or decson functon d(x, w, b) gven as d(x, w, b) = w T x + b = n w x b, (5) where x, w ƒ n, and the scaar b s caed a bas. (Note that the dashed separaton nes n Fg 5 represent the ne that foows from d(x, w, b) = 0). After the successfu tranng stage, by usng the weghts obtaned, the earnng machne, gven prevousy unseen pattern x p, produces output o accordng to an ndcator functon gven as F = o = sgn(d(x p, w, b)), (6) where o s the standard notaton for the output from the earnng machne. In other words, the decson rue s: f d(x p, w, b) > 0, the pattern x p beongs to a cass (.e., o = y = +), and f d(x p, w, b) < 0 the pattern x p beongs to a cass (.e., o = y = -). x Smaest margn M Cass x Cass Cass Separaton ne,.e., decson boundary Cass Largest margn M x x Fgure 5 Two-out-of-many separatng nes: a good one wth a arge margn (rght) and a ess acceptabe separatng ne wth a sma margn, (eft).

17 Lnear Maxma Margn Cassfer for Lneary Separabe Data 3 The ndcator functon F gven by (6) s a step-wse (.e., a stars-wse) functon (see Fgs 6 and 7). At the same tme, the decson (or dscrmnant) functon d(x, w, b) s a hyperpane. Note aso that both a decson hyperpane d and the ndcator functon F ve n an n + - dmensona space or they e over a tranng pattern s n-dmensona nput space. There s one more mathematca object n cassfcaton probems caed a separaton boundary that ves n the same n-dmensona space of nput vectors x. Separaton boundary separates vectors x nto two casses. Here, n cases of neary separabe data, the boundary s aso a (separatng) hyperpane but of a ower order than d(x, w, b). The decson (separaton) boundary s an ntersecton of a decson functon d(x, w, b) and a space of nput features. It s gven by d(x, w, b) = 0. (7) A these functons and reatonshps can be foowed, for two-dmensona nputs x, n Fg 6. In ths partcuar case, the decson boundary.e., separatng (hyper)pane s actuay a separatng ne n a x x pane and, a decson functon d(x, w, b) s a pane over the - dmensona space of features,.e., over a x x pane. Desred vaue y Indcator functon F(x, w, b) = sgn(d) Input x The separaton boundary s an ntersecton of d(x, w, b) wth the nput pane (x, x ). Thus t s: w T x + b = 0 + Input pane (x, x ) 0 Input pane (x, x ) Support vectors are star data - d(x, w, b) Margn M The decson functon (optma canonca separatng hyperpane) d(x, w, b) s an argument of the ndcator functon. Input x Fgure 6 The defnton of a decson (dscrmnant) functon or hyperpane d(x, w, b), a decson (separatng) boundary d(x, w, b) = 0 and an ndcator functon F = sgn(d(x, w, b)) whch vaue represents a earnng, or SV, machne s output o. In the case of -dmensona tranng patterns x (.e., for -dmensona nputs x to the earnng machne), decson functon d(x, w, b) s a straght ne n an x-y pane. An ntersecton of ths ne wth an x-axs defnes a pont that s a separaton boundary between two casses. Ths can be foowed n Fg 7. Before attemptng to fnd an optma separatng hy-

18 4 Vojsav Kecman perpane havng the argest margn, we ntroduce the concept of the canonca hyperpane. We depct ths concept wth the hep of the -dmensona exampe shown n Fg 7.Not qute ncdentay, the decson pane d(x, w, b) shown n Fg 6 s aso a canonca pane. Namey, the vaues of d and of F are the same and both are equa to for the support vectors depcted by stars. At the same tme, for a other tranng patterns d > F. In order to present a noton of ths new concept of the canonca pane, frst note that there are many hyperpanes that can correcty separate data. In Fg 7 three dfferent decson functons d(x, w, b) are shown. There are nfntey many more. In fact, gven d(x, w, b), a functons d(x, kw, kb), where k s a postve scaar, are correct decson functons too. Because parameters (w, b) descrbe the same separaton hyperpane as parameters (kw, kb) there s a need to ntroduce the noton of a canonca hyperpane: A hyperpane s n the canonca form wth respect to tranng data x X f T Nmn w x b. x X (8) The sod ne d(x, w, b) = -x + 5 n Fg 7 fufs (8) because ts mnma absoute vaue for the gven sx tranng patterns beongng to two casses s. It acheves ths vaue for two patterns, chosen as support vectors, namey for x 3 =, and x 4 = 3. For a other patterns, d > Target y,.e., d d(x, k w, k b) - The decson boundary. -3 For a -dm nput, t s a pont or, a zero-order -4 hyperpane. d(x, k w, k b) -5 The decson functon s a (canonca) hyperpane d(x, w, b). For a -dm nput, t s a (canonca) straght ne. The ndcator functon F = sgn(d(x, w, b)) s a step-wse functon. It s a SV machne output o Feature x The two dashed nes represent decson functons that are not canonca hyperpanes. However, they do have the same separaton boundary as the canonca hyperpane here. Fgure 7 SV cassfcaton for -dmensona nputs by the near decson functon. Graphca presentaton of a canonca hyperpane. For -dmensona nputs, t s actuay a canonca straght ne (depcted as a thck straght sod ne) that passes through ponts (+, +) and (+3, -) defned as the support vectors (stars). The two dashed nes are the two other decson hyperpanes (.e., straght nes). The tranng nput patterns {x = 0.5, x =, x 3 = } Cass have a desred or target vaue (abe) y = +. The nputs {x 4 = 3, x 5 = 4, x 6 = 4.5, x 7 = 5} Cass have the abe y = -.

19 Lnear Maxma Margn Cassfer for Lneary Separabe Data 5 Note an nterestng deta regardng the noton of a canonca hyperpane that s easy checked. There are many dfferent hyperpanes (panes and straght nes for -D and -D probems n Fgs 6 and 7 respectvey) that have the same separaton boundary (sod ne and a dot n Fgs 6 (rght) and 7 respectvey). At the same tme there are far fewer hyperpanes that can be defned as canonca ones fufng (8). In Fg 7,.e., for a - dmensona nput vector x, the canonca hyperpane s unque. Ths s not the case for tranng patterns of hgher dmenson. Dependng upon the confguraton of cass eements, varous canonca hyperpanes are possbe. Therefore, there s a need to defne an optma canonca hyperpane (OCSH) as a canonca hyperpane havng a maxma margn. Ths search for a separatng, maxma margn, canonca hyperpane s the utmate earnng goa n statstca earnng theory underyng SV machnes. Carefuy note the adjectves used n the prevous sentence. The hyperpane obtaned from a mted tranng data must have a maxma margn because t w probaby better cassfy new data. It must be n canonca form because ths w ease the quest for sgnfcant patterns, here caed support vectors. The canonca form of the hyperpane w aso smpfy the cacuatons. Fnay, the resutng hyperpane must utmatey separate tranng patterns. We avod the dervaton of an expresson for the cacuaton of a dstance (margn M) between the cosest members from two casses for ts smpcty. The curous reader can derve the expresson for M as gven beow, or t can ook n (Kecman, 00) or other books. The margn M can be derved by both the geometrc and agebrac argument and s gven as M w. (9) Ths mportant resut w have a great consequence for the constructve (.e., earnng) agorthm n a desgn of a maxma margn cassfer. It w ead to sovng a quadratc programmng (QP) probem whch w be shown shorty. Hence, the good od gradent earnng n NNs w be repaced by souton of the QP probem here. Ths s the next mportant dfference between the NNs and SVMs and foows from the mpementaton of SRM n desgnng SVMs, nstead of a mnmzaton of the sum of error squares, whch s a standard cost functon for NNs. Equaton (9) s a very nterestng resut showng that mnmzaton of a norm of a hyperpane norma weght vector w = T ww w w " w n eads to a maxmzaton of a margn M. Because a mnmzaton of f s equvaent to the mnmzaton of f, the mnmzaton of a norm w equas a mnmzaton of w T n w = w w w " wn, and ths eads to a maxmzaton of a margn M. Hence, the earnng probem s

20 6 Vojsav Kecman mnmze wt w, (0a) subject to constrants ntroduced and gven n (0b) beow. x The optma canonca separaton hyperpane x Cass, y = + x Support Vectors w x 3 Margn M Cass, y = - x Fgure 8 The optma canonca separatng hyperpane (OCSH) wth the argest margn ntersects hafway between the two casses. The ponts cosest to t (satsfyng y j w T x j + b =, j =, N SV ) are support vectors and the OCSH satsfes y (w T x + b) t =, (where denotes the number of tranng data and N SV stands for the number of SV). Three support vectors (x and x from cass, and x 3 from cass ) are the textured tranng data. (A mutpcaton of w T w by 0.5 s for numerca convenence ony, and t doesn t change the souton). Note that n the case of neary separabe casses emprca error equas zero (R emp = 0 n (a)) and mnmzaton of w T w corresponds to a mnmzaton of a confdence term :. The OCSH,.e., a separatng hyperpane wth the argest margn defned by M = / w, specfes support vectors,.e., tranng data ponts cosest to t, whch satsfy y j [w T x j + b] {, j =, N SV. For a the other (non-svs data ponts) the OCSH satsfes nequates y [w T x + b] >. In other words, for a the data, OCSH shoud satsfy the foowng constrants y [w T x + b], =, (0b) where denotes a number of tranng data ponts, and N SV stands for a number of SVs. The ast equaton can be easy checked vsuay n Fgs 6 and 7 for -dmensona and - dmensona nput vectors x respectvey. Thus, n order to fnd the OCSH havng a maxma margn, a earnng machne shoud mnmze w subject to the nequaty constrants (0b). Ths s a cassc quadratc optmzaton probem wth nequaty constrants. Such

21 Lnear Maxma Margn Cassfer for Lneary Separabe Data 7 an optmzaton probem s soved by the sadde pont of the Lagrange functona (Lagrangan) L(w, b, ) = T T w w D{ y[ w x b] } () where the D are Lagrange mutpers. The search for an optma sadde pont (w o, b o, ) o s necessary because Lagrangan L must be mnmzed wth respect to w and b, and has to be maxmzed wth respect to nonnegatve D (.e., D t 0 shoud be found). Ths probem can be soved ether n a prma space (whch s the space of parameters w and b) or n a dua space (whch s the space of Lagrange mutpers D ). The second approach gves nsghtfu resuts and we w consder the souton n a dua space beow. In order to do that, we use Karush-Kuhn-Tucker (KKT) condtons for the optmum of a constraned functon. In our case, both the objectve functon () and constrants (0b) are convex and KKT condtons are necessary and suffcent condtons for a maxmum of (). These condtons are: at the sadde pont (w o, b o, o ), dervatves of Lagrangan L wth respect to prma varabes shoud vansh whch eads to, wl 0,.e., w D y x w w o o () wl 0,.e., Dy 0 (3) w b o and the KKT compementarty condtons beow (statng that at the souton pont the products between dua varabes and constrants equas zero) must aso be satsfed, D {y [w T x + b]-} = 0, =,. (4) Substtutng () and (3) nto a prma varabes Lagrangan L(w, b, to the dua varabes Lagrangan L d (D) ) (), we change L d (D) = T D yy jdd j j, j xx. (5) In order to fnd the optma hyperpane, a dua Lagrangan L d ( ) has to be maxmzed wth respect to nonnegatve D (.e., D must be n the nonnegatve quadrant) and wth respect to the equaty constrant as foows In formng the Lagrangan, for constrants of the form f > 0, the nequaty constrants equatons are mutped by nonnegatve Lagrange mutpers (.e., D t 0) and subtracted from the objectve functon.

22 8 Vojsav Kecman D t 0, =, (6a) D y 0 (6b) Note that the dua Lagrangan L d ( ) s expressed n terms of tranng data and depends ony on the scaar products of nput patterns (x T x j ). The dependency of L d ( ) on a scaar product of nputs w be very handy ater when anayzng nonnear decson boundares and for genera nonnear regresson. Note aso that the number of unknown varabes equas the number of tranng data. After earnng, the number of free parameters s equa to the number of SVs but t does not depend on the dmensonaty of nput space. Such a standard quadratc optmzaton probem can be expressed n a matrx notaton and formuated as foows: Maxmze subject to L d ( ) = 0.5 T + I T, (7a) y T = 0, (7b) D t 0, =, (7c) where =[D, D,..., D ] T, H denotes the Hessan matrx ( H j =y y j (x x j ) = y y j x T x j ) of ths probem, and f s an (, ) unt vector f = = [... ] T. (Note that maxmzaton of (7a) equas a mnmzaton of L d ( ) = 0.5 T + I T, subject to the same constrants). Soutons D o of the dua optmzaton probem above determne the parameters w o and b o of the optma hyperpane accordng to () and (4) as foows w D y x, (8a) o o ( ( xw) ( ( xw ), s =, N SV. (8b) NSV T NSV T o s o y s s s s o NSV ys N SV b In dervng (8b) we used the fact that y can be ether + or -, and /y = y. N SV denotes the number of support vectors. There are two mportant observatons about the cacuaton of w o. Frst, an optma weght vector w o, s obtaned n (8a) as a near combnaton of the tranng data ponts and second, w o (same as the bas term b 0 ) s cacuated by usng ony the seected data ponts caed support vectors (SVs). The fact that the summatons n (8a) goes over a tranng data patterns (.e., from to ) s rreevant because the Lagrange mutpers for a non-support vectors equa zero (D o = 0, = N SV +, ). Fnay, havng cacuated w o and b o we obtan a decson hyperpane d(x) and an ndcator functon F = o = sgn(d(x)) as gven beow d(x) = xx, F = o = sgn(d(x)). (9) T w ox bo y D bo

23 Lnear Soft Margn Cassfer for Overappng Casses 9 Tranng data patterns havng non-zero Lagrange mutpers are caed support vectors. For neary separabe tranng data, a support vectors e on the margn and they are generay just a sma porton of a tranng data (typcay, N SV << ). Fgs 6, 7 and 8 show the geometry of standard resuts for non-overappng casses. Before presentng appcatons of OCSH for both overappng casses and casses havng nonnear decson boundares, we w comment ony on whether and how SV based near cassfers actuay mpement the SRM prncpe. The more detaed presentaton of ths mportant property can be found n (Kecman, 00; Schökopf and Smoa 00)). Frst, t can be shown that an ncrease n margn reduces the number of ponts that can be shattered.e., the ncrease n margn reduces the VC dmenson, and ths eads to the decrease of the SVM capacty. In short, by mnmzng w (.e., maxmzng the margn) the SV machne tranng actuay mnmzes the VC dmenson and consequenty a generazaton error (expected rsk) at the same tme. Ths s acheved by mposng a structure on the set of canonca hyperpanes and then, durng the tranng, by choosng the one wth a mnma VC dmenson. A structure on the set of canonca hyperpanes s ntroduced by consderng varous hyperpanes havng dfferent w. In other words, we anayze sets S A such that w d A. Then, f A d A d A 3 d... d A n, we ntroduced a nested set S A S A S A3... S An. Thus, f we mpose the constrant w d A, then the canonca hyperpane cannot be coser than /A to any of the tranng ponts x. Vapnk n (Vapnk, 995) states that the VC dmenson h of a set of canonca hyperpanes n ƒ n such that w d A s H d mn[r A, n] +, (0) where a the tranng data ponts (vectors) are encosed by a sphere of the smaest radus R. Therefore, a sma w resuts n a sma h, and mnmzaton of w s an mpementaton of the SRM prncpe. In other words, a mnmzaton of the canonca hyperpane weght norm w mnmzes the VC dmenson accordng to (0). See aso Fg 4 that shows how the estmaton error, meanng the expected rsk (because the emprca rsk, due to the near separabty, equas zero) decreases wth a decrease of a VC dmenson. Fnay, there s an nterestng, smpe and powerfu resut (Vapnk, 995) connectng the generazaton abty of earnng machnes and the number of support vectors. Once the support vectors have been found, we can cacuate the bound on the expected probabty of commttng an error on a test exampe as foows E [P(error)] E[number of support vectors] d, () where E denotes expectaton over a tranng data sets of sze. Note how easy t s to estmate ths bound that s ndependent of the dmensonaty of the nput space. Therefore, an SV machne havng a sma number of support vectors w have good generazaton abty even n a very hgh-dmensona space.

24 0 Vojsav Kecman Exampe beow shows the SVM s earnng of the weghts for a smpe separabe data probem n both the prma and the dua doman. The sma number and ow dmensonaty of data pars s used n order to show the optmzaton steps anaytcay and graphcay. The same reasonng w be n the case of hgh dmensona and arge tranng data sets but for them, one has to rey on computers and the nsght n souton steps s necessary ost. Exampe: Consder a desgn of SVM cassfer for 3 data shown n Fg 9 beow. Target y,.e., d + (c) b (a) (b) Feature x w Fgure 9 Left: Sovng SVM cassfer for 3 data shown. SVs are star data. Rght: Souton space w-b Frst we sove the probem n the prma doman: From the constrants (0b) t foows w tb, ( a) w tb, ( b) b t. ( c) The three straght nes correspondng to the equates above are shown n Fg 9 rght. The textured area s a feasbe doman for the weght w and bas b. Note that the area s not defned by the nequaty (a), thus pontng to the fact that the pont - s not a support vector. Ponts - and 0 defne the textured area and they w be the supportng data for our decson functon. The task s to mnmze (0a), and ths w be acheved by takng the vaue w =. Then, from (b), t foows that b =. Note that (a) must not be used for the cacuaton of the bas term b. Because both the cost functon (0a) and the constrants (0b) are convex, the prma and the dua souton must produce same w and b. Dua souton foows from maxmzng (5) subject to (6) as foows ª 4 0ºª D º L d D D D3 [ D D D3 ] «0»«D», «0 0 0»«¼ D» 3 ¼ s.t. -D D D3 0, D t0, D t0, D t0, 3 The dua Lagrangan s obtaned n terms of D and D after expressng D 3 from the equaty constrant and t s gven as Ld DD 0.5(4D 4 DD D ). L d w have maxmum for D = 0, and t foows that we have to fnd the maxmum of

25 Lnear Soft Margn Cassfer for Overappng Casses Ld D 0.5D whch w be at D. Note that the Hessan matrx s extremey bad condtoned and f the QP probem s to be soved by computer H shoud be reguarzed frst. From the equaty constrant t foows that D 3 too. Now, we can cacuate the weght vector w and the bas b from (8a) and (8b) as foows, w 3 D yx 0( )( ) ( )( ) ()0 The bas can be cacuated by usng SVs ony, meanng from ether pont - or pont 0. Both resut n same vaue as shown beow b ( ), or b (0).. Lnear Soft Margn Cassfer for Overappng Casses The earnng procedure presented above s vad for neary separabe data, meanng for tranng data sets wthout overappng. Such probems are rare n practce. At the same tme, there are many nstances when near separatng hyperpanes can be good soutons even when data are overapped (e.g., normay dstrbuted casses havng the same covarance matrces have a near separaton boundary). However, quadratc programmng soutons as gven above cannot be used n the case of overappng because the constrants y [w T x + b] t, =, gven by (0b) cannot be satsfed. In the case of an overappng (see Fg 0), the overapped data ponts cannot be correcty cassfed and for any mscassfed tranng data pont x, the correspondng D w tend to nfnty. Ths partcuar data pont (by ncreasng the correspondng D vaue) attempts to exert a stronger nfuence on the decson boundary n order to be cassfed correcty. When the D vaue reaches the maxma bound, t can no onger ncrease ts effect, and the correspondng pont w stay mscassfed. In such a stuaton, the agorthm ntroduced above chooses (amost) a tranng data ponts as support vectors. To fnd a cassfer wth a maxma margn, the agorthm presented n the secton. above, must be changed aowng some data to be uncassfed. Better to say, we must eave some data on the wrong sde of a decson boundary. In practce, we aow a soft margn and a data nsde ths margn (whether on the correct sde of the separatng ne or on the wrong one) are negected. The wdth of a soft margn can be controed by a correspondng penaty parameter C (ntroduced beow) that determnes the trade-off between the tranng error and VC dmenson of the mode. The queston now s how to measure the degree of mscassfcaton and how to ncorporate such a measure nto the hard margn earnng agorthm gven by equatons (0). The smpest method woud be to form the foowng earnng probem mnmze wt w + C(number of mscassfed data), ()

26 Vojsav Kecman where C s a penaty parameter, tradng off the margn sze (defned by w,.e., by w T w) for the number of mscassfed data ponts. Large C eads to sma number of mscassfcatons, bgger w T w and consequenty to the smaer margn and vce versa. Obvousy takng C = UHTXLUHVWKDWWKHQXPEHURIPLVFODVVLILHGGDWDLV]HURDQGLQWKHFDVHRIDQRYHrappng ths s not possbe. Hence, the probem may be feasbe ony for some vaue C < x = - d(x ), >, mscassfed postve cass pont 3 0 x x 3 4 = 0 x 4 x Cass, y = + = + d(x ), >, mscassfed negatve cass pont Cass, y = - d(x) = - d(x) = + x Fgure 0 The soft decson boundary for a dchotomzaton probem wth data overappng. Separaton ne (sod), margns (dashed) and support vectors (textured tranng data ponts). ). 4 SVs n postve cass (crces) and 3 SVs n negatve cass (squares). mscassfcatons for postve cass and mscassfcaton for negatve cass. However, the serous probem wth () s that the error s countng can t be accommodated wthn the handy (meanng reabe, we understood and we deveoped) quadratc programmng approach. Aso, the countng ony can t dstngush between huge (or dsastrous) errors and cose msses! The possbe souton s to measure the dstances [ of the ponts crossng the margn from the correspondng margn and trade ther sum for the margn sze as gven beow mnmze wt w + C(sum of dstances of the wrong sde ponts), (3) In fact ths s exacty how the probem of the data overappng was soved n (Cortes, 995; Cortes and Vapnk, 995) - by generazng the optma hard margn agorthm. They ntroduced the nonnegatve sack varabes [ ( =, ) n the statement of the optmzaton probem for the overapped data ponts. Now, nstead of fufng (0a) and (0b), the separatng hyperpane must satsfy mnmze wt w + C [, (4a)

27 Lnear Soft Margn Cassfer for Overappng Casses 3 subject to.e., subject to y [w T x + b] t - [, =,, [ t 0, w T x + b t + - [, for y = +, [ t 0, (4b) (4c) w T x + b d - + [, for y = -, [ t 0,. (4d) Hence, for such a generazed optma separatng hyperpane, the functona to be mnmzed comprses an extra term accountng the cost of overappng errors. In fact the cost functon (4a) can be even more genera as gven beow mnmze wt k w + C [, (4e) subject to same constrants. Ths s a convex programmng probem that s usuay soved ony for k = or k =, and such soft margn SVMs are dubbed L and L SVMs respectvey. By choosng exponent k =, nether sack varabes [ nor ther Lagrange mutpers E appear n a dua Lagrangan L d. Same as for a neary separabe probem presented prevousy, for L SVMs (k = ) here, the souton to a quadratc programmng probem (4), s gven by the sadde pont of the prma Lagrangan L p ( w, b,,, ) shown beow T w wc ( [ ) L p ( w, b,,, ) = T D { y[ wx b] [ } E [, for L SVM (5) where D and E are the Lagrange mutpers. Agan, we shoud fnd an optma sadde pont ( w, b,,, ) because the Lagrangan L p has to be mnmzed wth respect to w, o o o o o b and, and maxmzed wth respect to nonnegatve D and E. As before, ths probem can be soved n ether a prma space or dua space (whch s the space of Lagrange mutpers D and E.). Agan, we consder a souton n a dua space as gven beow by usng - standard condtons for an optmum of a constraned functon wl 0,.e., w D y x, (6) w w o o wl 0,.e., Dy 0, (7) w b o wl w[ o and the KKT compementarty condtons beow, 0,.e., D E C, (8)

28 4 Vojsav Kecman D {y [w T x + b]- + [ }= 0, =,. (9a) [ = (C - D )[ = 0, =,. (9b) At the optma souton, due to the KKT condtons (9), the ast two terms n the prma Lagrangan L p gven by (5) vansh and the dua varabes Lagrangan L d ( ), for L SVM, s not a functon of E. In fact, t s same as the hard margn cassfer s L d gven before and repeated here for the soft margn one, L d ( ) = T D yy jdd j j, j xx. (30) In order to fnd the optma hyperpane, a dua Lagrangan L d ( ) has to be maxmzed wth respect to nonnegatve and (unke before) smaer than or equa to C, D. In other words wth C t D t 0, =,, (3a) and under the constrant (7),.e., under D y 0. (3b) Thus, the fna quadratc optmzaton probem s practcay same as for the separabe case the ony dfference beng n the modfed bounds of the Lagrange mutpers D. The penaty parameter C, whch s now the upper bound on D, s determned by the user. The seecton of a good or proper C s aways done expermentay by usng some crossvadaton technque. Note that n the prevous neary separabe case, wthout data overappng, ths upper bound C = f. We can aso ready change to the matrx notaton of the probem above as n equatons (7). Most mportant of a s that the earnng probem s expressed ony n terms of unknown Lagrange mutpers D, and known nputs and outputs. Furthermore, optmzaton does not soey depend upon nputs x whch can be of a very hgh (ncusve of an nfnte) dmenson, but t depends upon a scaar product of nput vectors x. It s ths property we w use n the next secton where we desgn SV machnes that can create nonnear separaton boundares. Fnay, expressons for both a decson functon d(x) and an ndcator functon F = sgn(d(x)) for a soft margn cassfer are same as for neary separabe casses and are aso gven by (9). From (9) foows that there are ony three possbe soutons for D (see Fg 0). D = 0, [ = 0, data pont x s correcty cassfed,. C > D > 0, then, the two compementarty condtons must resut resut n y [w T x + b]- + [ = 0, and [ = 0. Thus, y [w T x + b] = and x s a support vector. The support vectors wth C t D t 0 are caed unbounded or free support vectors. They e on the two margns,

29 The Nonnear Cassfer 5 3. D = C, then, y [w T x + b]- + [ = 0, and [ t 0, and x s a support vector. The support vectors wth D = C are caed bounded support vectors. They e on the wrong sde of the margn. For > [ t 0, x s st correcty cassfed, and f [ t, x s mscassfed. For L SVM the second term n the cost functon (4e) s quadratc,.e., C [, and ths eads to changes n a dua optmzaton probem whch s now, subject to L d ( ) = Gj D DD C ¹ T yy j j j, j xx, (3) D t 0, =,, (33a) D y 0. (33b) where, j = for = j, and t s zero otherwse. Note the change n Hessan matrx eements gven by second terms n (3), as we as that there s no upper bound on D. The detaed anayss and comparsons of the L and L SVMs s presented n (Abe, 004). Dervaton of (3) and (33) s gven n the Appendx. We use the most popuar L SVMs here, because they usuay produce more sparse soutons,.e., they create a decson functon by usng ess SVs than the L SVMs..3 The Nonnear Cassfer The near cassfers presented n two prevous sectons are very mted. Mosty, casses are not ony overapped but the genune separaton functons are nonnear hypersurfaces. A nce and strong characterstc of the approach presented above s that t can be easy (and n a reatvey straghtforward manner) extended to create nonnear decson boundares. The motvaton for such an extenson s that an SV machne that can create a nonnear decson hypersurface w be abe to cassfy nonneary separabe data. Ths w be acheved by consderng a near cassfer n the so-caed feature space that w be ntroduced shorty. A very smpe exampe of a need for desgnng nonnear modes s gven n Fg where the true separaton boundary s quadratc. It s obvous that no erroress near separatng hyperpane can be found now. The best near separaton functon shown as a dashed straght ne woud make sx mscassfcatons (textured data ponts; 4 n the negatve cass and n the postve one). Yet, f we use the nonnear separaton boundary we are abe to separate two casses wthout any error. Generay, for n-dmensona nput pat-

30 6 Vojsav Kecman terns, nstead of a nonnear curve, an SV machne w create a nonnear separatng hypersurface. A mappng )(x) s chosen n advance..e., t s a fxed functon. Note that an nput space (x-space) s spanned by components x of an nput vector x and a feature space F ()-space) s spanned by components I (x) of a vector )(x). By performng such a mappng, we hope that n a )-space, our earnng agorthm w be abe to neary separate mages of x by appyng the near SVM formuaton presented above. (In fact, t can be shown that for a whoe cass of mappngs the near separaton n a feature space s aways possbe. Such mappngs w correspond to the postve defnte kernes that w be shown shorty). We aso expect ths approach to agan ead to sovng a quadratc optmzaton probem wth smar constrants n a )-space. The souton for an ndcator functon F (x) = sgn(w T ) T (x) + b) = sgn y ( ) ( ) D [ [ b, whch s a near cassfer n a feature space, w create a nonnear separatng hypersurface n the orgna nput space gven by (35) bex Nonnear separaton boundary Cass, y = + Cass, y = - Ponts mscassfed by near separaton boundary are textured x Fgure A nonnear SVM wthout data overappng. A true separaton s a quadratc curve. The nonnear separaton ne (sod), the near one (dashed) and data ponts mscassfed by the near separaton ne (the textured tranng data ponts) are shown. There are 4 mscassfed negatve data and mscassfed postve ones. SVs are not shown. The basc dea n desgnng nonnear SV machnes s to map nput vectors x ƒ n nto vectors )(x) of a hgher dmensona feature space F (where ) represents mappng: ƒ n o ƒ f ), and to sove a near cassfcaton probem n ths feature space x ƒ n o )(x) = [I (x) I (x),..., I n (x)] T ƒ f, (34)

31 The Nonnear Cassfer 7 ow. (Compare ths souton wth (9) and note the appearances of scaar products n both the orgna X-space and n the feature space F). The equaton for an F (x) just gven above can be rewrtten n a neura networks form as foows T F (x) = sgn y ( ) ( ) D [ [ b (35) y D (, ) k b vk(, ) x x b = sgn x x = sgn where v corresponds to the output ayer weghts of the SVM s network and k(x, x) denotes the vaue of the kerne functon that w be ntroduced shorty. (v equas y D n the cassfcaton case presented above and t s equa to (D - D * ) n the regresson probems). Note the dfference between the weght vector w whch norm shoud be mnmzed and whch s the vector of the same dmenson as the feature space vector )(x) and the weghtngs v = D y that are scaar vaues composng the weght vector v whch dmenson equas the number of tranng data ponts. The ( - N SVs ) of v components are equa to zero, and ony N SVs entres of v are nonzero eements. A smpe exampe beow (Fg ) shoud exempfy the dea of a nonnear mappng to (usuay) hgher dmensona space and how t happens that the data become neary separabe n the F-space. d x = - x = 0 x 3 = x d(x) F (x) - Fgure A nonnear -dmensona cassfcaton probem. One possbe souton s gven by the decson functon d(x) (sod curve).e., by the correspondng ndcator functon defned as F = sgn(d(x)) (dashed stepwse functon). Consder sovng the smpest -D cassfcaton probem gven the nput and the output (desred) vaues as foows: x = [- 0 ] T and d = y = [- -] T. Here we choose the foowng mappng to the feature space: )(x) = [ (x) (x) 3 (x)] T = [x x ] T. The mappng produces the foowng three ponts n the feature space (shown as the rows of the matrx F (F standng for features))

32 8 Vojsav Kecman F ª º «0 0» ¼ These three ponts are neary separabe by the pane 3 (x) = (x) n a feature space as shown n Fg 3. It s easy to show that the mappng obtaned by )(x) = [x x ] T s a T j j scaar product mpementaton of a quadratc kerne functon (x x ) k( x, x ). In other words, ) T (x ) )(x j ) = k( x, x ). Ths equaty w be ntroduced shorty. j 3-D feature space Const x 3 [ ] T x [0 0 ] T x [ ] T x x Fgure 3 The three data ponts of a probem n Fg are neary separabe n the feature space (obtaned by the mappng )(x) = [ (x) (x) 3 (x)] T = [x x ] T ). The separaton boundary s gven as the pane 3(x) = (x) shown n the fgure. There are two basc probems when mappng an nput x-space nto hgher order F-space: ) the choce of mappng )(x) that shoud resut n a rch cass of decson hypersurfaces, ) the cacuaton of the scaar product ) T (x) )(x) that can be computatonay very dscouragng f the number of features f (.e., dmensonaty f of a feature space) s very arge. The second probem s connected wth a phenomenon caed the curse of dmensonaty. For exampe, to construct a decson surface correspondng to a poynoma of degree two n an n-d nput space, a dmensonaty of a feature space f = n(n + 3)/. In other words, a feature space s spanned by f coordnates of the form

33 The Nonnear Cassfer 9 z = x,, z n = x n (n coordnates), z n+ = (x ),, z n = (x n ) (next n coordnates), z n+ = x x,, z f = x n x n- (n(n-)/ coordnates), and the separatng hyperpane created n ths space, s a second-degree poynoma n the nput space (Vapnk, 998). Thus, constructng a poynoma of degree two ony, n a 56- dmensona nput space, eads to a dmensonaty of a feature space f = 33,5. Performng a scaar product operaton wth vectors of such, or hgher, dmensons, s not a cheap computatona task. The probems become serous (and fortunatey ony seemngy unsovabe) f we want to construct a poynoma of degree 4 or 5 n the same 56-dmensona space eadng to the constructon of a decson hyperpane n a bon-dmensona feature space. Ths exposon n dmensonaty can be avoded by notcng that n the quadratc optmzaton probem gven by (5) and (30), as we as n the fna expresson for a cassfer, tranng data ony appear n the form of scaar products x T x j. These products w be repaced by scaar products ) T (x))(x) = [I (x), I (x),..., I n (x)] T [I (x ), I (x ),..., I n (x )] n a feature space F, and the atter can be and w be expressed by usng the kerne functon K(x, x j ) = ) T (x ))(x j ). Note that a kerne functon K(x, x j ) s a functon n nput space. Thus, the basc advantage n usng kerne functon K(x, x j ) s n avodng performng a mappng )(x) et a. Instead, the requred scaar products n a feature space ) T (x ))(x j ), are cacuated drecty by computng kernes K(x, x j ) for gven tranng data vectors n an nput space. In ths way, we bypass a possby extremey hgh dmensonaty of a feature space F. Thus, by usng the chosen kerne K(x, x j ), we can construct an SVM that operates n an nfnte dmensona space (such a kerne functon s a Gaussan kerne functon gven n tabe beow). In addton, as w be shown beow, by appyng kernes we do not even have to know what the actua mappng )(x) s. A kerne s a functon K such that K(x, x j ) = ) T (x ))(x j ). (36) There are many possbe kernes, and the most popuar ones are gven n tabe. A of them shoud fuf the so-caed Mercer s condtons. The Mercer s kernes beong to a set of reproducng kernes. For further detas see (Mercer, 909; Azerman et a, 964; Smoa and Schökopf, 997; Vapnk, 998; Kecman 00). The smpest s a near kerne defned as K(x, x j ) = x T x j. Beow we show a few more kernes: POYNOMIAL KERNELS: Let x ƒ.e., x=[x x ] T, and f we choose )(x) =[ x o ƒ 3 mappng), then the dot product x x x ] T (.e., there s an ƒ ) T (x ))(x j ) = [x x x x ] [x j x jx j x j ] T = [x x j + x x x j x + x x j ] = (x T x j ) = K(x, x j ), or

34 30 Vojsav Kecman K(x, x j ) = (x T x j ) = ) T (x ))(x j ) Note that n order to cacuate the scaar product n a feature space ) T (x ))(x j ), we do not need to perform the mappng )(x) = [ x x x x ] T et a. Instead, we cacuate ths product drecty n the nput space by computng (x T x j ). Ths s very we known under the popuar name of the kerne trck. Interestngy, note aso that other mappngs such as an ƒ oƒ 3 mappng gven by )(x) = [ x - x x x x + x ], or an ƒ o ƒ 4 mappng gven by )(x) = [x x x x x x ] aso accompsh the same task as (x T x j ) Now, assume the foowng mappng )(x) = [ x x x x x x ],.e., there s an ƒ o ƒ 5 mappng pus bas term as the constant 6 th dmenson s vaue. Then the dot product n a feature space F s gven as ) T (x ))(x j ) = + x x j + x x j + x x x j x + x x j + x x j = + (x T x j ) + (x T x j ) = (x T x j + ) = K(x, x j ), or K(x, x j ) = (x T x j + ) = ) T (x ))(x j ) Thus, the ast mappng eads to the second order compete poynoma. Many canddate functons can be apped to a convouton of an nner product (.e., for kerne functons) K(x, x ) n an SV machne. Each of these functons constructs a dfferent nonnear decson hypersurface n an nput space. In the frst three rows, the tabe shows the three most popuar kernes n SVMs n use today, and the nverse mutquadrcs one as an nterestng and powerfu kerne to be proven yet. Tabe. Popuar Admssbe Kernes Kerne functons Type of cassfer K(x, x ) = (x T x ) Lnear, dot product, kerne, CPD K(x, x ) = [(x T x ) + ] d Compete poynoma of degree d, PD [( ) T xx 6 ( xx )] K( xx, ) e Gaussan RBF, PD K(x, x ) = tanh[(x T x ) + b]* Mutayer perceptron, CPD K( xx, ) Inverse mutquadrc functon, PD xx E *ony for certan vaues of b, (C)PD = (condtonay) postve defnte

35 The Nonnear Cassfer 3 The postve defnte (PD) kernes are the kernes whch Gramm matrx G (a.k.a. Gramman) cacuated by usng a the tranng data ponts s postve defnte (meanng a ts egenvaues are strcty postve,.e., O > 0, =, ) G K( x, x ) j ª k( x, x) k( x, x) " k( x, x ) º «k(, ) k(, ) k(, )» «x x x x # x x» «# # # #» «k( x, x ) k( x, x ) " k( x, x )» ¼ (37) The kerne matrx G s a symmetrc one. Even more, any symmetrc postve defnte matrx can be regarded as a kerne matrx, that s - as an nner product matrx n some space. Fnay, we arrve at the pont of presentng the earnng n nonnear cassfers (n whch we are utmatey nterested here). The earnng agorthm for a nonnear SV machne (cassfer) foows from the desgn of an optma separatng hyperpane n a feature space. Ths s the same procedure as the constructon of a hard (5) and soft (30) margn cassfers n an x-space prevousy. In a )(x)-space, the dua Lagrangan, gven prevousy by (5) and (30), s now L d ( ) = T D DD jyy j j, j, (38) and, accordng to (36), by usng chosen kernes, we shoud maxmze the foowng dua Lagrangan subject to L d ( ) = D DD yyk( x, x ), (39) j j j, j D t 0, =, and D y 0. (39a) In a more genera case, because of a nose or due to generc cass features, there w be an overappng of tranng data ponts. Nothng but constrants for D change. Thus, the nonnear soft margn cassfer w be the souton of the quadratc optmzaton probem gven by (39) subject to constrants C t D t 0, =, and D y 0. (39b) Agan, the ony dfference to the separabe nonnear cassfer s the upper bound C on the Lagrange mutpers D. In ths way, we mt the nfuence of tranng data ponts that w reman on the wrong sde of a separatng nonnear hypersurface. After the dua varabes are cacuated, the decson hypersurface d(x) s determned by

36 3 Vojsav Kecman d( x) yd K( x, x ) b v K( x, x ) b, (40) ª º and the ndcator functon s F( x) sgn[ d( x)] sgn «vk( x, x ) b». ¼ Note that the summaton s not actuay performed over a tranng data but rather over the support vectors, because ony for them do the Lagrange mutpers dffer from zero. The exstence and cacuaton of a bas b s now not a drect procedure as t s for a near hyperpane. Dependng upon the apped kerne, the bas b can be mpcty part of the kerne functon. If, for exampe, Gaussan RBF s chosen as a kerne, t can use a bas term as the f + st feature n F-space wth a constant output = +, but not necessary. In short, a PD kernes do not necessary need an expct bas term b, but b can be used. (More on ths can be found n (Kecman, Huang, and Vogt, 004) as we as n the (Vogt and Kecman, 004). Same as for the near SVM, (39) can be wrtten n a matrx notaton as maxmze L d ( ) = 0.5 T + I T, (4a) subject to y T = 0, (4b) and C t D t 0, =,, (4c) where =[D, D,..., D ] T, H denotes the Hessan matrx ( H j = y y j K(x, x j )) of ths probem and f s an (, ) unt vector f = = [... ] T. Note that f K(x, x j ) s the postve defnte matrx, then so s the matrx y y j K(x, x j ) too. The foowng -D exampe (just for the sake of graphca presentaton) w show the creaton of a near decson functon n a feature space and a correspondng nonnear (quadratc) decson functon n an nput space. Suppose we have 4 -D data ponts gven as x =, x =, x 3 = 5, x 4 = 6, wth data at,, and 6 as cass and the data pont at 5 as cass,.e., y = -, y = -, y 3 =, y 4 = -. We use the poynoma kerne of degree, K(x, y) = (xy + ). C s set to 50, whch s of esser mportance because the constrants w be not mposed n ths exampe for maxma vaue for the dua varabes apha w be smaer than C = 50. Case : Workng wth a bas term b as gven n (40). We frst fnd D ( =,, 4) by sovng dua probem (4) havng a Hessan matrx ª º «9 5-69» H « » « » ¼

37 The Nonnear Cassfer 33 Aphas are D = 0, D = , D 3 = D 4 = and the bas b w be found by usng (8b), or by fufng the requrements that the vaues of a decson functon at the support vectors shoud be the gven y. The mode (decson functon) s gven by 4 4 d( x) yd K( x, x) b v( xx ) b, or by d(x) = (-)(x + ) ()(5x + ) (-)(6x + ) + b d(x) = x x + b Bas b s determned from the requrement that at the SV ponts, 5 and 6, the outputs must be -, and - respectvey. Hence, b = -9, resutng n the decson functon d(x) = x x 9. The nonnear (quadratc) decson functon and the ndcator one are shown n Fg 4. Note that n cacuatons above 6 decma paces have been used for apha vaues. The cacuaton s numercay very senstve, and workng wth fewer decmas can gve very approxmate or wrong resuts. The compete poynoma kerne as used n the case, s postve defnte and there s no need to use an expct bas term b as presented above. Thus, one can use the same second order poynoma mode wthout the bas term b. Note that n ths partcuar case there s no equaty constrant equaton that orgnates from an equazaton of the prma Lagrangan dervatve n respect to the bas term b to zero. Hence, we do not use (4b) whe usng a postve defnte kerne wthout bas as t w be shown beow n the case. y, d NL SV cassfcaton. D nput. Poynoma, quadratc, kerne used x Fgure 4 The nonnear decson functon (sod) and the ndcator functon (dashed) for -D overappng data. By usng a compete second order poynoma the mode wth and wthout a bas term b are same.

38 34 Vojsav Kecman Case : Workng wthout a bas term b Because we use the same second order poynoma kerne, the Hessan matrx H s same as n the case. The souton wthout the equaty constrant for aphas s: D = 0, D = , D 3 = , D 4 = The mode (decson functon) s gven by 4 4 D, or by d( x) y K( x, x ) v ( xx ) d(x) = (-)(x + ) ()(5x + ) (-)(6x + ) d(x) = x x 9. Thus the nonnear (quadratc) decson functon and consequenty the ndcator functon n the two partcuar cases are equa. XOR Exampe: In the next exampe shown by Fgs 4 and 5 we present a the mportant mathematca objects of a nonnear SV cassfer by usng a cassc XOR (excusve-or) probem. The graphs show a the mathematca functons (objects) nvoved n a nonnear cassfcaton. Namey, the nonnear decson functon d(x), the NL ndcator functon F (x), tranng data (x ), support vectors (x SV ) and separaton boundares. The same objects w be created n the cases when the nput vector x s of a dmensonaty n >, but the vsuazaton n these cases s not possbe. In such cases one taks about the decson hyperfuncton (hypersurface) d(x), ndcator hyperfuncton (hypersurface) F (x), tranng data (x ), support vectors (x SV ) and separaton hyperboundares (hypersurfaces). Note the dfferent character of a d(x), F (x) and separaton boundares n the two graphs gven beow. However, n both graphs a the data are correcty cassfed. The anaytc souton to the Fg 6 for the second order poynoma kerne (.e., for (x T x j + ) = ) T (x ))(x j ), where )(x) = [ x x x x x x ], no expct bas and C = f) goes as foows. Inputs and desred outputs are, ª 0 0º «, 0 0» ¼ matrx T x y = d T. The dua Lagrangan (39) has the Hessan H ª - - º «9-4 -4» «- -4 4» «- -4 4» ¼

A four data are chosen as support vectors.

39 The Nonnear Cassfer 35 Decson and ndcator functon of a NL SVM x x Separaton boundares Input x Input x Fgure 5 XOR probem. Kerne functons (-D Gaussans) are not shown. The nonnear decson functon, the nonnear ndcator functon and the separaton boundares are shown. A four data are chosen as support vectors.. x Decson and ndcator functon of a nonnear SVM x Input pane Hyperboc separaton boundares Fgure 6 XOR probem. Kerne functon s a -D poynoma. The nonnear decson functon, the nonnear ndcator functon and the separaton boundares are shown. A four data are support vectors.

MARKOV CHAIN AND HIDDEN MARKOV MODEL

MARKOV CHAIN AND HIDDEN MARKOV MODEL JIAN ZHANG JIANZHAN@STAT.PURDUE.EDU Markov chan and hdden Markov mode are probaby the smpest modes whch can be used to mode sequenta data,.e. data sampes whch are not