Diplomarbeit. Support Vector Machines in der digitalen Mustererkennung

Size: px

Start display at page:

Download "Diplomarbeit. Support Vector Machines in der digitalen Mustererkennung"

Philippa Heath
6 years ago
Views:

Fachberech Informatk Dpomarbet Support Vector Machnes n der dgtaen Mustererkennung Ausgeführt be der Frma Semens VDO n Regensburg vorgeegt von: Chrstan

1 Fachberech Informatk Dpomarbet Support Vector Machnes n der dgtaen Mustererkennung Ausgeführt be der Frma Semens VDO n Regensburg vorgeegt von: Chrstan Mkos St.-Wofgangstrasse 9305 Regensburg Betreuer: Erstprüfer: Zwetprüfer: Herr Renhard Rös Prof. Jürgen Sauer Prof. Dr. Herbert Kopp Abgabedatum:

2 Acknowedgements Ths work was wrtten as m dpoma thess n computer scence at the unverst of apped scences Regensburg, German, under the supervson of Prof. Dr. Jürgen Sauer. The research was carred out at Semens VDO n Regensburg, German. In Renhard Rös I found a ver competent advsor there, whom I owe much for hs assstance n a aspects of m work. Thank ou ver much! For the hep durng wrtng ths document I want to thank a coeagues at the department at Semens VDO. I have enjoed the work there ver much n an sense and earned a ot whch sure w be usefu n the upcomng ears. M speca thanks go to Prof. Jürgen Sauer who heped me out n an questons arsng durng ths work.

3 CONTENTS ABSTRACT 5 NOTATIONS 6 0 INTRODUCTION 7 I AN INTRODUCTION TO THE LEARNING THEORY AND BASICS 9 SUPERVISED LEARNING THEORY 0. Modeng the Probem LEARNING TERMINOLOGY 4. Rsk Mnmzaton 4. Structura Rsk Mnmzaton (SRM) 6.3 The VC Dmenson 7.4 The VC Dmenson of Support Vector Machnes, Error Estmaton and Generazaton Abt 8 3 PATTERN RECOGNITION 3. Feature Etracton 3. Cassfcaton 4 OPTIMIZATION THEORY 5 4. The Probem 5 4. Lagrangan Theor Duat Kuhn-Tucker Theor 33 II SUPPORT VECTOR MACHINES 35 5 LINEAR CLASSIFICATION Lnear Cassfers on Lnear Separabe Data The Optma Separatng Hperpane for Lnear Separabe Data Support Vectors Cassfcaton of unseen data The Optma Separatng Hperpane for Lnear Non-Separabe Data Norm Soft Margn - or the Bo Constrant Norm Soft Margn - or Weghtng the Dagona The Duat of Lnear Machnes Vector/Matr Representaton of the Optmzaton Probem and Summar 58

4 5.5. Vector/Matr Representaton Summar 59 6 NONLINEAR CLASSIFIERS Epct Mappngs Impct Mappngs and the Kerne Trck Requrements for Kernes - Mercer s Condton Makng Kernes from Kernes Some we-known Kernes Summar 78 7 MODEL SELECTION The RBF Kerne Cross Vadaton 8 8 MULTICLASS CLASSIFICATION 8 8. One-Versus-Rest (OVR) 8 8. One-Versus-One (OVO) Other Methods 87 III IMPLEMENTATION 88 9 IMPLEMENTATION TECHNIQUES Genera Technques Sequenta Mnma Optmzaton (SMO) Sovng for two Lagrange Mutpers Heurstcs for choosng whch Lagrange Mutpers to optmze Updatng the threshod b and the Error Cache Speedng up SMO The mproved SMO agorthm b Keerth SMO and the -norm case Data Pre-Processng Categorca Features Scang Matab Impementaton and Eampes Lnear Kerne Ponoma Kerne Gaussan Kerne (RBF) The Impact of the Penat Parameter C on the Resutng Cassfer and the Margn 8 IV MANUALS, AVAILABLE TOOLBOXES AND SUMMARY 0 MANUAL 3 0. Matab Impementaon 3 0. Matab Eampes The C++ Impementaton for the Neura Network Too Avaabe Tooboes mpementng SVM Overa Summar 38 LIST OF FIGURES 40 3

5 LIST OF TABLES 4 LITERATURE 43 STATEMENT 45 APPENDIX 46 A SVM - APPLICATION EXAMPLES 46 A. Hand-wrtten Dgt Recognton 46 A. Tet Categorzaton 47 B LINEAR CLASSIFIERS 48 B. The Perceptron 48 C CALCULATION EXAMPLES 54 D SMO PSEUDO CODES 59 D. Pseudo Code of orgna SMO 59 D. Pseudo Code of Keerth s mproved SMO 6 4

6 ABSTRACT The Support Vector Machne (SVM) s a new and ver promsng cassfcaton technque deveoped b Vapnk and hs group at AT&T Be Laboratores. Ths new earnng agorthm can be seen as an aternatve tranng technque for Ponoma, Rada Bass Functon and Mut-Laer Perceptron cassfers. Recent t has shown ver good resuts n the pattern recognton fed of research, such as hand-wrtten character and dgt or face recognton but the aso proofed themseves reabe n tet categorzaton. It s mathematca ver funded and of great growng nterest nowadas n man new feds of research such as Bonformatcs. De Support Vector Machne (SVM) st ene neue und sehr veversprechende Kassfzerungs-Methode, entwcket von Vapnk und sener Gruppe n den AT&T Be Forschungsenrchtungen. Deser neue Ansatz m Berech des computergestützten Lernens kann as aternatve Tranngstechnk für Ponom-, Gaußkern- und Mut-Laer Perzeptron-Kassfzerer aufgefasst werden. In jüngster Zet zegte dese neue Technk sehr gute Ergebnsse m Berech der Mustererkennung, we z.b. Erkennung von handschrftchen Buchstaben und Zahen oder Geschtszügen. Desweteren wurde se auch zuverässg m Berech der Tetkategorserung engesetzt. De Technk st mathematsch sehr gut fundert und von mmer wachsenderem Interesse n neueren Forschungsgebeten, we der Bonformatk. 5

7 NOTATIONS z nput vector (nput durng tranng, aread abeed) nput vector (nput after tranng, to be cassfed) output: cass of nput ( z ) X t nput space vector transposed a b nner product between vector a and b (dot product) sgn( f ( )) the sgnum functon: + f f ( ) 0 and - ese tranng set sze S tranng set {(, ) :... } ( w, b) defnes the hperpane H { : w + b 0} ξ Lagrange mutpers sack varabes (for near non-separabe datasets) γ margn of a snge pont L L P, D Lagrangan: prma and dua C error weght K ( a, b) kerne functon cacuated wth vectors a and b K kerne matr ( k, ) ) ( j SVM SV nsv rbf LM ERM SRM Support Vector Machne support vectors number of support vectors rada bass functons earnng machne Emprca Rsk Mnmzaton Structura Rsk Mnmzaton 6

8 Chapter 0 Introducton In ths work the rather new concept n earnng theor, the Support Vector Machne, w be dscussed n deta. The goa of ths work s to gve an nsght nto the methods used and to descrbe them n a wa a person wth not so much funded mathematca background coud understand them. So the gap between theor and practce coud be cosed. It s not the ntenton of ths work to ook n ever aspect and agorthm avaabe n the fed of ths earnng theor but to understand how and wh t even works and wh t s of such rsng nterest at the tme. Ths work shoud a the bascs for understandng the mathematca background, to be abe to mpement the technque and to do further research whether ths technque s sutabe for the wanted purpose at a. As a product of ths work the Support Vector Machne w be mpemented both n Matab and C++. The C++ part w be a modue ntegrated nto the so caed Neura Network Too aread used n the department at Semens VDO, whch aread mpements the Ponoma and Rada-Bass Functon cassfers. Ths too s for testng purposes to test sutabe technques for the ater ntegraton nto the ane recognton sstem for cars current under deveopment there. Support Vector Machnes for cassfcaton are a rather new concept n earnng theor. It s orgns reach back to the ear 60 s (VAPNIK and LEARNER 963; VAPNIK and CHERVONENKIS 964), but t strred up attenton on n 995 wth Vadmr Vapnk s book The Nature of Statstca Learnng Theor [Vap95]. In the ast few ears Support Vector Machnes proofed eceent performance n man rea-word appcatons such as hand-wrtten character recognton, mage cassfcaton or tet categorzaton. But because man aspects n ths theor are st under ntensve research the number of ntroductor terature s ver mted. The two books b Vadmr Vapnk (The Nature of Statstca Learnng Theor [Vap 95] and Statstca Learnng Theor [Vap98] present on a genera hgh-eve ntroducton to ths fed. The frst tutora pure on Support Vector Machnes was wrtten b C. Burges n 998 [Bur98]. In the ear 000 CHRISTIANINI and SHAWE-TAYLOR pubshed An ntroducton to Support Vector Machnes [Ne00], whch was the man source for ths work. 7

9 A these books and papers gve a good overvew of the theor behnd Support Vector Machnes, but the don t gve a straghtforward ntroducton to appcaton. Here ths work puts on. Ths work s dvded nto four parts: Part I gves an ntroducton nto the supervsed earnng theor and the deas behnd pattern recognton. Pattern recognton s the envronment n whch the Support Vector Machne w be used n ths work. The net chapter w a the mathematca bascs for the optmzaton probem arsng. Part II then ntroduces the Support Vector Machne tsef wth ts mathematca background. For a better understandng the case of cassfcaton w be restrcted to the two-cass probem frst but ater one can see that ths s no probem because t then can eas be etended to the mutcass case. Here aso the ong studed kerne technque w be anased n deta whch gves the Support Vector Machnes ther superor power. Part III then anases the mpementaton technques for Support Vector Machnes. It w be shown that there are man approaches for sovng the arsng optmzaton probem but on the most used and best performng agorthms for a great amount of data w be dscussed n deta. Part IV n the end s ntended as a manua for the mpementaton done n Matab and C++. There w aso be gven a st of wde used tooboes for Support Vector Machnes, both n C/C++ and Matab. Last but not east n the append some rea-word appcatons, some cacuaton eampes on the arsng mathematca probems, the rather smpe Perceptron agorthm for cassfcaton and the pseudo code used for the mpementaton w be stated. 8

10 Part I An ntroducton to the Learnng Theor and Bascs 9

11 Chapter Supervsed Learnng Theor When computers are apped to sove a practca probem t s usua the case that the method of dervng the requred output from a set of nputs can be descrbed epct. But there arse man cases where one wants the machne to perform tasks that cannot be descrbed b an agorthm. Such tasks cannot be soved b cassca programmng technques, snce no mathematca mode ests for them or the computaton of the eact souton s ver epensve (t coud ast for hundreds of ears, even on the fastest processors). As eampes consder the probem of performng hand-wrtten dgt recognton (a cassca probem of machne earnng) or the detecton of faces on a pcture. There s need for a dfferent approach to sove such probems. Mabe the machne s teachabe, as chdren are n schoo? Meanng the are not gven abstract defntons and theores b the teacher but he ponts out eampes of the nput-output functonat. Consder the chdren earnng the aphabet. The teacher does not gve them precse defntons of each etter, but he shows them eampes. Thereb the chdren earn genera propertes of the etters b eamnng these eampes. In the end these chdren w be abe to read words n scrpt ste, even f the were taught on on tpes. In other more mathematca words ths observatons eads to the concept of cassfers. The purpose of earnng such a cassfer from few gven eampes aread correct cassfed b the supervsor, s to be abe to cassf future unknown observatons correct. But how can earnng from eampes, whch s caed supervsed earnng, be formuzed mathematca to et t be apped to a machne? 0

12 . Modeng the Probem Learnng from eampes can be descrbed n a genera mode b the foowng eements: The generator of the nput data, the supervsor who assgns abes/casses to the data for earnng and the earnng machne that returns some answer hopefu cose to the one of the supervsor. The abeed/precassfed eampes (, ) are referred to as the tranng data. The nput/output parngs tpca refect a functona reatonshp, mappng the nputs to outputs, though ths s not awas the case, for eampe when the outputs are corrupted b nose. But when an underng functon ests t s referred to as the target functon. So the goa s the estmaton of ths target functon whch s earnt b the earnng machne and s known as the souton of the earnng probem. In case of cassfcaton probems, e.g. ths s a man and ths s a woman, ths functon s aso known as the decson functon. The optma souton s chosen from a set of canddate functons whch map from the nput space to the output doman. Usua a set or cass of canddate functons s chosen known as hpotheses. As an eampe consder so-caed decson trees whch are hpotheses created b constructng a bnar tree wth smpe decson functons at the nterna nodes and output vaues at the eaves (the - vaues). A earnng probem wth bnar outputs (0/, es/no, postve/negatve, ) s referred to as a bnar cassfcaton probem, one wth a fnte number of categores as a mut-cass cassfcaton one, whe for rea-vaued outputs the probem s known as regresson. In ths dpoma thess on the frst two categores w be consdered, athough the ater dscussed Support Vector Machnes can be eas etended to the regresson case. A more mathematca nterpretaton of ths w be gven now. The generator above determnes the envronment n whch the supervsor n and the earnng machne act. It generates the vectors R ndependent and dentca dstrbuted accordng to some unknown probabt dstrbuton P(). The supervsor assgns the true output vaues accordng to a condtona dstrbuton functon P( ) (output s dependent on nput). Ths assumpton eads to the case f() n whch the supervsor assocates a fed wth ever. The earnng machne then s defned b a set of possbe mappngs f (, ) where s eement of a parameter space. An eampe of a earnng machne accordng to bnar cassfcaton s defned b orented n+ hperpanes { : w + b 0} where ( w, b) R determnes the

13 n R. As a resut the foowng earnng ma- poston of the hperpanes n chne (LM) s obtaned: LM { f (,( w, b)) sgn( w + b);( w, b) R n+ } The functons f : R n {, + }, mappng the nput to the postve (+) or negatve (-) cass, are caed the decson functons. So ths earnng machne works as foows: the nput s assgned to the postve cass, f f() 0, and otherwse to the negatve cass. Fgure.: Mutpe possbe decson functons n the R. The are defned as w + b 0 (for detas see part II of ths work). Ponts to the eft are assgned to the postve (+)cass, and the ones to the rght to the negatve (-) cass, The above defnton of a earnng machne s caed a Lnear Learnng Machne because of the near nature of the functon f used here. Among a possbe functons, the near ones are the best understood and smpest to app. The w provde the framework wthn whch the constructon of more compe sstems s possbe and w be done n ater chapters. There s need for a choce of the parameter based on observatons (the tranng set): Ths method of a earnng machne w be descrbed n deta n Part II, because Support Vector Machnes mpement ths technque.

14 S n {(, )...(, ); R, {, + }} Ths s caed the tranng of the machne. The tranng set S s drawn accordng to the dstrbuton P(, ). If a ths data s gven to the earner (the machne) at the start of the earnng phase, ths s caed batch earnng. But f the earner receves on one eampe at a tme and gves an estmaton of the output before recevng the correct vaue, t s caed onne earnng. In ths work on batch earnng s consdered. Aso each of these two earnng methods can be subdvded nto unsupervsed earnng and supervsed earnng. Once a functon for approprate mappng the nput to the output s chosen (earned), one wants to see how we t works on unseen data. Usua the tranng set s spt nto two parts: the abeed tranng set above and the so-caed abeed test set. Ths test set s apped after tranng, knowng the epected output vaues, and comparng the resuts of the cassfcaton of the machne wth the epected ones to gan the error rate of the machne. But smp verfng the quat of an agorthm n such a wa s not enough. It s not on the goa of a ganed hpothess to be consstent wth the tranng set but aso to work fne on future data. But there are aso other probems nsde the whoe process of generatng a verfabe consstent hpothess. Frst the functon tred to earn ma have not a smpe representaton and hence ma not be eas verfed n ths wa. Second the tranng data coud be frequent nos and so there s no guarantee that there s an underng functon whch correct maps the tranng data. But the man probem arsng n practce s the choce of the features. Features are the components the nput vector conssts of. Sure the have to descrbe the nput data for cassfcaton n an approprate wa. Approprate means, for eampe, no or ess redundanc. Some hnts on choosng a sutabe representaton for the data w be gven n the upcomng chapters but not n deta because ths woud bow up the frame. As an eampe to the second probem consder the cassfcaton of web pages nto categores, whch can never be an eact scence. But such data s ncreasng of nterest for earnng. So there s a need for measurng the quat of a cassfer n some other wa: Good generazaton. The abt of a hpothess/cassfer to correct cassf data, not on the tranng set, or n other words, make precse predctons b earnng from few eampes, s known as ts generazaton abt, and ths s the propert whch has to be optmzed. 3

15 Chapter Learnng Termnoog Ths chapter s ntended to stress the man concepts arsng from the theor of statstca earnng [Vap79] and the VC Theor [Vap95]. These concepts are the fundamentas of earnng machnes. Here terms such as generazaton abt and capact w be descrbed.. Rsk Mnmzaton As seen n the ast chapter, the task of a earnng machne s to nfer genera features from a set of abeed eampes, the tranng set. It s the goa to generaze from the tranng eampe to the whoe range of possbe observatons. The success of ths s measured b the abt to correct cassf new unseen data not beongng to the tranng set. Ths s caed the Generazaton Abt. But as n the tranng a set of possbe hpothess arse, there s need for some measure to choose the optma one what s the same as ater measurng the generazaton abt. Mathematca ths can be epressed usng the Rsk Functon, a measure of quat, usng the epected cassfcaton error for a traned machne. Ths epected rsk (the test error), s the possbe average error commtted b the chosen hpothess f (, ) on the unknown eampe drawn random from the sampe dstrbuton P(, ): R( ) f (, ) dp(, ) (.) Here the functon f (, ) s caed the oss (dfference between epected output (b supervsor) and the response of the earnng machne). R ( ) s referred to as the Rsk Functon or smp the rsk. The goa s to 4

16 * * fnd parameters such that f (, ) mnmzes the rsk over the cass of functons f (, ). But snce P(, ) s unknown, the vaue of the rsk for a gven parameter cannot be computed drect. The on avaabe nformaton s contaned n the gven tranng set S. So the emprca rsk R emp () s defned to be just the measured mean error rate on the tranng set of fnte ength : R emp ( ) f (, ) (.) Note that here no probabt dstrbuton appears and R emp () s a fed number for a partcuar choce of and for a tranng set S. For further consderatons, assume bnar cassfcaton wth outputs {, + }. Then the oss functon can on produce the outputs 0 or. Now choose some η such that 0 < η <. Then for osses takng these vaues, wth probabt η, the foowng bound hods [Vap95]: R( ) R emp ( ) + h(og( / h) + ) og( η / 4) (.3) where h s a non-negatve nteger caed the Vapnk Chervonenks (VC) dmenson. It s a measure of the noton of capact. The second summand of the rght hand sde s caed the VC confdence. Capact s the abt of a machne to earn an tranng set wthout error. It s a measure of the rchness or febt of the functon cass. A machne wth too much capact tends to overfttng, whereas ow capact eads to errors on the tranng set. The most popuar concept to descrbe the rchness of a functon cass n machne earnng s the Vapnk Chervonenks (VC) dmenson. Burges gves an ustratve eampe on capact n hs paper [Bur98]: A machne wth too much capact s ke a botanst wth a photographc memor who, when presented wth a new tree, concudes that t s not a tree because t has a dfferent number of eaves from anthng he has seen before. A machne wth too tte capact s ke the botanst s az brother, who decares that f t s green, t s a tree. Nether can generaze we. 5

17 To concude ths subchapter there can be drawn three ke ponts about the bound of (.3): Frst t s ndependent of the dstrbuton P(, ). It on assumes that the tranng and test data are drawn ndependent accordng to some dstrbuton P(, ). Second, t s usua not possbe to compute the eft hand sde. Thrd, f h s known, t s eas possbe to compute the rght hand sde. The bound aso shows that ow rsk depends both on the chosen cass of functons (the earnng machne) and on the partcuar functon chosen b the earnng agorthm, the hpothess, whch shoud be optma. The bound decreases f a good separaton on the tranng set s acheved b a earnng machne wth ow VC dmenson. Ths approach eads to prncpes of the structura rsk mnmzaton (SRM).. Structura Rsk Mnmzaton (SRM) Let the entre cass of functons K { f (, )}, be dvded nto nested subsets of functons such that K K... K n.... For each subset t must be abe to compute the VC dmenson h, or get a bound on h tsef. Then SRM conssts of fndng that subset of functons whch mnmzes the bound on the rsk. Ths can be done b smp tranng a seres of machnes, one for each subset, where for a gven subset the goa of tranng s smp to mnmze the emprca rsk. Then the traned machne n the seres whose sum of emprca rsk and VC confdence s mnma. So overa the approach s workng as foows: The confdence nterva s kept f (b choosng partcuar h s) and the emprca rsk s mnmzed. In the neura network case ths technque s adapted b frst choosng an approprate archtecture and then emnatng cassfcaton errors. The second approach s to keep the emprca rsk fed (e.g. equa to zero) and mnmze the confdence nterva. Support Vector Machnes w aso mpement the prncpes of SRM, b fndng the one canonca hperpane among a, whch mnmzes the norm w n the defnton of a hperpane b: w b. SVMs, hperpanes, canonca hperpanes and wh mnmzng the norm w be epaned n part II of ths work n deta, here on a reference s gven to ths. 6

18 .3 The VC Dmenson The VC dmenson s a propert of a set of functons { f ( )} and can be defned for varous casses of functons f. But agan, here on the functons correspondng to the two-cass pattern case wth {, + } are consdered. Defnton. (Shatterng) If a gven set of ponts can be abeed n a possbe was, and for each abeng, a member of the set { f ( )} can be found whch correct assgns those abes (cassfes), ths set of ponts s sad to be shattered b that set of functons. Defnton. (VC Dmenson) The Vapnk Chervonenks (VC) Dmenson of a set of functons s defned as the mamum number of tranng ponts that can be shattered b t. The VC dmenson s nfnte, f ponts can be shattered b the set of functons, no matter how arge s. Note that f the VC dmenson s h, then there ests at east one set of h ponts that can be shattered, but n genera t w not be true that ever set of h ponts can be shattered. As an eampe consder shatterng ponts wth orented hperpanes n n R. To gve an ntroducton assume the data ves n the R space and the set of functons { f ( )} conssts of orented straght nes, so that for a gven ne, a ponts on one sde are assgned to the cass +, and a ponts on the other one to the cass -. The orentaton n the foowng fgures s ndcated b an arrow, specfng the sde where the ponts of cass + are ng. Whe t s possbe to fnd three ponts that can be shattered (fgure.) b ths set of functons, t s not possbe to fnd four. Thus the VC dmenson of the set of orented nes n the R s three. Wthout proof (ths can be found n [Bur98]) t can be stated that the VC n dmenson of the set of orented hperpanes n the R s n+. 7

19 Fgure.: Three ponts not ng n a ne can be shattered b orented hperpanes n the R. The arrow ponts n the drecton of the postve eampes (back). Whereas four ponts can be found n the R, whch cannot be shattered b orented hperpanes..4 The VC Dmenson of Support Vector Machnes, Error Estmaton and Generazaton Abt It shoud be sad frst that ths subchapter does not cam competeness n an sense. There w be no proofs on the concusons stated and the contents wrtten are on ecerpts of the theor. Ths s because the theor stated here s beond the ntenton of ths work. The nterested reader can refer to the books about the Statstca Learnng Theor [Vap79], VC Theor [Vap95] and man other works on ths. Here on some mportant subsets for Support Vector Machnes of the whoe theor w be shown. Imagne ponts n the R, whch shoud be bnar cassfed: cass + or cass -. The are consstent wth man cassfers (hpothesses, set of 8

20 functons). But how can one mnmze the room of the hpothess set? One approach s to app a margn to each data pont (fgures. and.), then, the broader that margn, the smaer the room for hpotheses s. Ths approach s justfed b Vapnk s earnng theor. Fgure.: Reducng the room for hpothess b appng a margn to each pont The man concuson of ths technque s, that a wde margn often eads to a good generazaton abt but can restrct the febt n some cases. Therefore the ater ntroduced mama margn approach for Support Vector Machnes s a practcabe wa. And ths technque means that Support Vector Machnes mpement the prncpes of Structura Rsk Mnmzaton. The actua rsk of Support Vector Machnes was bounded b [Vap95] aternatve. The term Support Vectors here w be epaned n part II of ths work, the bound s on stated here but s rea genera, because often one can see that the bound behaves n the other drecton: Few Support Vectors, but hgh bound. 9

21 E[ P( error )] E[ number _ of _ sup port _ vectors] Numer _ of _ tranng _ sampes (.4) Where P(error) s the rsk for a machne traned on - eampes, E[P(error)] s the epectaton over a choces of tranng sets of sze and E[numer_of_support_vectors] s the epectaton of the number of support vectors over a choces of tranng sets of sze. To end ths sub chapter some known VC dmensons of the ater ntroduced Support Vector Machnes shoud be stated, but wthout proof: Support Vector Machnes mpementng Gaussan Kernes 3 have nfnte VC dmenson and the ones usng ponoma Kernes of degree p have p + d VC dmenson of + 4 where d s the dmenson where the p data ves, e.g. R. So here the VC dmenson s fnte but grows rapd wth the degree. Aganst the bound of (.3) ths resut s a dsappontng one, because of the nfnte VC dmenson when usng Gaussan Kernes and therefore the bound becomng useess. But because of new deveopments n generazaton theor, the usage of even nfnte VC dmensons becomes practcabe. The man theor s about Mama Margn Bounds and gves another bound on the rsk, whch s even appcabe n the nfnte case. The theor works wth a new anass method n contrast to the VC dmenson: The fat-shatterng dmenson. To ook n the future: The generazaton performance of Support Vector Machnes s eceent n contrast to other ong studed methods, e.g. cassfcaton based on the Baesan theor. But as ths s beond ths work, on a reference w be gven here: The paper Generazaton Performance of Support Vector Machnes and Other Pattern Cassfers b Bartett and Shawe-Taor (999). Now the theoretc groundwork for ookng nto Support Vector Machnes has been ad and wh the work at a. 3 See chapter 6 4 n n! k, caed the bnoma coeffcent k!( n k)! 0

22 Chapter 3 Pattern Recognton Fgure 3.: Computer vson: Image processng and pattern recognton. The whoe probem s spt n sub probems to hande. Pattern recognton s arranged nto the computer vson part. Computer vson tres to teach the human part of notcng and understandng the envronment to a machne. The man probem thereb arsng s the ustraton of the three-dmensona envronment b two-dmensona sensors.

23 Defnton 3. (Pattern recognton) Pattern recognton s the theor of the best possbe assgnment of an unknown pattern or observaton to a meanng-cass (cassfcaton). In other words: The process of dentfcaton of objects, wth hep of aread earned eampes. So the purpose of a pattern recognton program s to anaze a scene (most n the rea word, wth ad of an nput devce such as a camera, for dgtzaton) and to arrve at a descrpton of the scene whch s usefu for the accompshment of some task, e.g. face detecton or hand-wrtten dgt recognton. 3. Feature Etracton Ths part are the procedures for measurng the reevant shape nformaton contaned n a pattern, so the task of cassfng the pattern s made eas b a forma procedure. For eampe, n character recognton a tpca feature mght be the heght-to-wdth raton of a etter. Such a feature w be usefu for dfferentatng between a W and an I but dstngushng between E and F ths feature woud be qute useess. So more features are necessar or the one gven above has to repaced b another. The goa of feature etracton s to fnd as few features as possbe that adequate dfferentate the pattern n a partcuar appcaton nto ther correspondng pattern casses. Because the more features there are, the more compcated the task of cassfcaton coud be, because the degree of freedom (the dmenson of vectors) grows and for each new feature ntroduced ou usua need some hundreds of new tranng ponts to get reabe statements on ther dervaton. To gve a nk to the Support Vector Machnes here: The theor on feature etracton s the man probem n practce, because of the proper seecton ou have to defne (avod redundanc) and because of the amount of test data ou have to create for tranng for each new feature ntroduced. 3. Cassfcaton The step of cassfcaton s concerned wth makng decsons concernng the cass membershp of a pattern n queston. The task s to desgn a de-

24 cson rue that s eas to compute and that w mnmze the probabt of mscassfcaton. To get a cassfer, the one decded to fuf ths step has to be traned b aread cassfed eampes, to get the optma decson rue, because when deang wth hgh compet casses the cassfer w not be descrbabe as a near one. Fgure 3.: Deveopment steps of a cassfer. As an eampe consder the dstncton between appes and pears. Here the a-pror knowedge s, that pears are hgher than broad and appes are broader than hgh. So one feature woud be the heght-wdth-rato. Another feature that coud be chosen s the weght. So the pcture of fgure 3.3 w be ganed after measurement of some eampes. As t can be seen, the cassfer can near be appromated to a near one (the horzonta ne). Other probems coud consst of more than on two casses, the casses coud overap and therefore there s need of some error-toeratng scheme and the usage of nonneart. There are two was for tranng a cassfer: Supervsed earnng Unsupervsed earnng The technque of supervsed earnng uses a representatve sampe, meanng t descrbes the casses ver good. The sampe eads to a cassfcaton, whch shoud appromate the rea casses n feature space. There the separaton boundares are computed. 3

25 In contrast to ths, unsupervsed earnng uses agorthms, whch anaze the groupng tendenc of the feature vectors nto pont couds (custerng). Smpe agorthms are e.g. the mnmum dstance cassfcaton, the mamum kehood cassfcator or cassfcators based on the Baesan theor. Fgure 3.3: Tranng a cassfer on the two-cass probem of dstngushng between appes and pears b the usage of two features (weght and heght-to-wdth rato).. 4

26 Chapter 4 Optmzaton Theor As we have seen n the frst two chapters, the earnng task ma be formuated as an optmzaton probem. The searched hpothess functon shoud therefore be chosen n a wa, so the rsk functon s mnmzed. Tpca ths optmzaton probem w be subject to some constrants Later we w see that n the support vector theor we are on concerned wth the case, n whch the functon to be mnmzed/mamzed, caed the cost functon, s a conve quadratc functon, whe the constrants are a near. The known methods for sovng such probems are caed conve quadratc programmng. In ths chapter we w take a coser ook at the Lagrangan theor, whch s the most adapted wa to sove such optmzaton probems wth man varabes. Furthermore the concept of duat w be ntroduced, whch pas a major roe n the concept of Support Vector Machnes. The Lagrangan theor was frst ntroduced n 797 and t on was abe to dea wth functons constraned b equates. Later n 95 ths theor was etended b Kuhn and Tucker to be adapted to the case of nequat constrants. Nowadas ths etenson s known as the Kuhn-Tucker theor. 4. THE PROBLEM The genera optmzaton probem can be wrtten as a mnmzaton probem, snce reversng the sgn of the functon to be optmzed turns t n the equa mamzaton probem. Defnton 4. (Prma Optmzaton Probem) Gven functons f, g and h defned on a doman be formuated: n Ω R, the probem can Mnmze f() ; Ω subject to g () 0 ;,, k h j () 0 ; j,, m 5

27 where f() s caed the objectve functon, g the nequat and h j the equat constrants. The optma vaue of the functon f s known as the vaue of the optmzaton probem. An optmzaton probem s caed a near program, f the objectve functon and a constrants are near, and a quadratc program, f the objectve functon s quadratc, whe the constrants reman near. Defnton 4. (Standard Lnear Optmzaton Probem) Mnmze subject to c T A b 0 or reformuated ths means: Mnmze c + + c n n and another representaton s: subject to a + + a n n b.. a n + + a nn n b n 0 Mnmze c subject to a j j b, mt n 0 It s possbe to rewrte each common near optmzaton probem n ths standard form, even f the constrants are gven as nequates. For further readngs on ths topc one can refer to man good tetbooks about optmzaton avaabe. There are man was to get the souton(s) of near probems, e.g. Gaussan Reducton, Smpe wth the Hessan Matr,., but we w not have such probems and therefore do not dscuss these technques here. 6

28 Defnton 4.3 (Standard Quadratc Optmzaton Probem) Mnmze subject to c T + T D A b 0 wth Matr D overa postve (sem-) defnte, so the objectve functon s conve. Sem defnte means, that for each, T D 0 (n other words, D has non-negatve egenvaues). Non-conve functons and domans are not dscussed here, because the w not pa an roe n the agorthms for Support Vector Machnes. For further readngs on nonnear optmzaton, refer to [Jah96]. So n ths probem ou have varabes n the form and, whch does not ead to a near sstem, where on the form s found. Defnton 4.4 (Conve domans) n A subdoman D of the R s conve, f for an two ponts, D the connecton between them s aso an eement of D. Mathematca ths means: for a h [0,], and, D (-h) + h D For eampe the R s a conve doman. In fgure 4. on the three upper domans are conve. Fgure 4.: 3 conve and non-conve domans 7

29 Defnton 4.5 (Conve functons) A functon f s sad to be conve n D for a, D and h [0,] ths appes: n R f(h + (-h)) hf() + (-h)f(), f the doman D s conve and In words ths means, that the graph of the functon awas es under the secant (or chord). Fgure 4.: Conve and concave functons Another crteron for convet of twce dfferentabe functons s the postve sem defnteness of the Hessan Matr [Jah96]. The probem of mnmzng a conve functon on a conve doman (set) s known as a conve programmng probem. The man advantage of such probems s the fact, that ever oca souton to the conve probem s aso a goba souton and that a goba souton s awas unque there. In nonnear, non-conve probems, the man probem are the oca mnmums. For eampe, agorthms mpementng the Gradent-Descent (- Ascent) method to fnd a mnmum (mamum) of the objectve functon cannot guarantee, that the found mnmum s a goba one, and so the souton woud not be optma. In the rest of ths dpoma and n the support vector theor, the optmzaton probem can be restrcted to the case of a conve quadratc functon n wth near constrants on the doman Ω R. 8

30 oca mnmum Fgure 4.4: A oca mnmum of a nonnear and non-conve functon 4. LAGRANGIAN THEORY The ntenton of the Lagrangan theor s to characterze the souton of an optmzaton probem nta, when there are no nequat constrants. Later the method was etended to the presence of nequat constrants, known as the Kuhn-Tucker theor. To ease the understandng we frst ntroduce the smpest case of optmzaton n absence of an constrants. Theorem 4.6 (Fermat) A necessar condton for w * to be a mnmum of f(w) f C,s that the * f ( w ) frst dervaton 0. Ths condton s aso suffcent f f s a conve w functon. n Addton: A pont 0 ( n ) R reazng ths condton s caed a statonar pont of the functon f: R n R. 9

31 To use ths on constraned probems, a functon, known as the Lagrangan, s defned, that untes nformaton about both the objectve functon and ts constrants. Then the statonart of ths can be used to fnd soutons. In append C ou can fnd a graphca souton to such a probem n two varabes and the cacuated Lagrangan souton to the same probem. Aso an eampe for the genera case s formuated there. Defnton 4.7 (Lagrangan) Gven an optmzaton probem wth objectve functon f(w) and the equat constrants h (w) c,. h n (w) c n, the Lagrangan functon s defned as L(w,) n f ( w) + ( c h ( w)) And as ever equat can be transformed to h (w) 0 c, the Lagrangan s L(w, ) n f ( w) + ( w) h The coeffcents are caed the Lagrange mutpers. Theorem 4.8 (Lagrange) A necessar condton for a pont w * n R to be a mnmum (souton) of the objectve functon f(w) subject to h (w) 0, n, wth f, h C s w * * L(, ) 0 w (Dervaton subject to w) * * L( w, ) 0 (Dervaton subject to ) These condtons are aso suffcent n the case that L(w, ) s a conve functon. Ths means the souton s a goba optmum. 30

32 The condtons provde a near sstem of n+m equatons, wth the ast m the equat constrants (See append C for eampes). B sovng ths sstem one obtans the souton. Note: At the optma pont the constrants equa zero and so the vaue of the Lagrangan s equa to the objectve functon: L(w, ) f(w * ) As an nterpretaton of the Lagrange mutper of the functon f ( w) + ( c h( w)), we assume t as a functon of c and dfferentate t wth respect to c: L c But n the optmum L(w, ) f(w * ). So we can nterpret that the Lagrange mutper gves a hnt on how the optmum s changng f the constant c of the constrant g(w) c s changed. Now to the most genera case, where the optmzaton probem both contans equat and nequat constrants. Defnton 4.9 (Generazed Lagrangan Functon) The genera optmzaton probem can be stated as Mnmze f(w) subject to g ( w) 0 ; k (nequates) h ( w) 0 ; j m (equates) j Then the generazed Lagrangan s defned as: k L( w,, β) f ( w) + g ( w) + β h ( w) T T f ( w) + g(w) + β h(w) m 3

33 4.3 DUALITY The ntroducton of dua varabes s a powerfu too, because usng ths aternatve - the dua - reformuaton of an optmzaton probem often turns out to be easer to sove n contrast to ts so caed prma probem because the handng of nequat constrants n the prma (whch are often found) s ver dffcut. The dua probem to a prma probem s obtaned b ntroducng the Lagrange mutpers, aso caed the dua varabes. So the dua functon does not depend on the prma varabes anmore and sovng ths probem s the same as sovng the prma one. The new dua varabes are then consdered to be the fundamenta unknowns of the probem. Duat s aso a common procedure n near optmzaton probems. For further readngs ook at [Mar00]. In genera the prma mnmzaton probem s then turned n the dua mamzaton one. So at the optma souton pont the prma and the dua functon both meet wth havng an etreme there (conve functons woud on have one goba etreme). Here we on ook at the duat method mportant for Support Vector Machnes. To transform the prma probem n ts dua one two steps are necessar. Frst, the dervatves of the set up prma Lagrangan are set to zero wth respect to the prma varabes. Second, substtute the so ganed reatons back nto the Lagrangan. Ths removes the dependenc on the prma varabes and corresponds to epct computng the new functon For proof of ths see [Ne00]. θ(, β) nf L( w,, β) w Ω So overa the prma mnmzaton probem of defnton 4. can be transformed n the dua probem as: Defnton 4.0 (Lagrangan Dua Probem) Mamze θ (, β) nf L( w,, β) subject to > 0 w Ω 5 5 Inf nfmum The nfmum of an subset of a near order (near ordered set) s the greatest ower bound of the subset. In partcuar, the nfmum of an set of numbers s the argest number n the set whch s ess than or equa to ever other number n the set. Rewrtten ths means: θ (, β) nf L( w,, β) ma(mn( L( w,, β))) w Ω 0 w 3

34 Ths strateg s a standard technque n the theor of Support Vector Machnes. As seen ater, the dua representaton aows us to work n hgh dmensona spaces usng so caed Kernes wthout fang pre to the curse of dmensonat 6. The Kuhn-Tucker compementar condtons, ntroduced n the foowng subchapter, ead to a sgnfcant reducton of the data nvoved n the tranng process. These condtons mp that on the actve constrants have non-zero dua varabes and therefore are necessar to determne the searched hperpane. Ths observaton w ater ead to the term support vectors, as seen n chapter KUHN-TUCKER THEORY Theorem 4. (Kuhn-Tucker) Gven an optmzaton probem wth conve doman n Ω R, Mnmze f(w) w Ω subject to g ( w) 0 k h ( w) 0 j m j wth f C conve and g, h affne, necessar and suffcent condtons for a pont w * to be a optmum, are the estence of *, β * such that * * * L( w,, β ) 0 w * * * L( w,, β ) 0 β g ( w * g ( w 0 * * * ) 0 ) 0 ; k 6 Epaned n chapter 6 33

35 The thrd reaton s aso known as the KT compementar condton. It mpes that for actve constrants, * * 0, whereas for nactve ones 0. As nterpretaton of the compementar condton one can sa, that a souton pont can be n one of two postons wth respect to an nequat constrant. Ether n the nteror of the feasbe regon, wth the constrant nactve, or on the boundar defned b that constrant wth the constrant actve. So the KT condtons sa that ether a constrant s actve, meanng * * g ( w ) 0 and 0, or the correspondng mutper 0. * So the KT condtons gve a hnt on how the souton ooks ke and how the Lagrange mutpers behave. And a pont s on an optma souton f and on f these KT condtons are fufed. Summarzng ths chapter t can be sad that a the theorems and defntons above gve some usefu technques for sovng conve optmzaton probems wth nequat and equat constrants both actng at the same tme. The goa of the technques s to smpf the prma gven probem b formuzng the dua one, n whch the constrants are most equates whch are easer to hande. The KT condtons descrbe the optma souton and ts mportant behavour and w be the stoppng crteron for the ater mpemented numerca soutons. Later n the chapters about mpementaton of the sovng agorthms to such optmzaton probems we w see that the man probem w be the sze of the tranng set, whch therefore defnes the sze of the kerne matr as a souton. Wth the use of standard technques for cacuatng the souton, the kerne matr w fast eceed hundreds of megabtes n the memor even when the sampe sze s just a few thousand ponts (whch s not much n rea-word appcatons). 34

36 Part II Support Vector Machnes 35

37 Chapter 5 Lnear Cassfcaton 5. Lnear Cassfers on Lnear Separabe Data As a frst step n understandng and constructng Support Vector Machnes we stud the case of near separabe data, whch s smp cassfed nto two casses, the postve and the negatve one, aso known as bnar cassfcaton. To gve a nk to an eampe, mportant nowadas, magne the cassfcaton probem of ema nto spam or not-spam. (A cacuated eampe and eampes on near (non-)separabe data can be found n Append B.) Ths s frequent performed b usng a rea-vaued functon f : X R n R n the foowng wa: The nput (,, n ) s assgned to the postve cass, f f ( ) 0, and otherwse to the negatve one. The vector s bud up b the reevant features whch are used for cassfcaton. In our spam eampe above we need to etract reevant features (certan words) from the tet and bud a feature vector for the correspondng document. Often such feature vectors consst of the counted numbers of predefned words as n fgure 5.. If ou woud ke to earn more about tet cassfcaton / categorzaton, ou can have a ook at [Joa98], where the feature vectors have dmensons n the range about In ths dpoma we assume that the features are aread avaabe. We consder the case where f () s a near functon of X, so t can be wrtten as f () w + b (5.) w + b where (w,b) R n R are the parameters. 36

38 Fgure 5.: Vector representaton of the sentence Take Vagra before watchng a vdeo or eave Vagra be to pa n our onne casno. These are often referred to as weght vector w and bas b, terms borrowed from the neura network terature. As stated n Part I, the goa s to earn these parameters from the gven and aread cassfed data (done b the supervsor/teacher), the tranng set. So ths wa of earnng s caed supervsed earnng. So the decson functon for cassfcaton of an nput (,, n ) s gven b sgn( f ( )) : sgn( f ( ) ), f f ( ) 0 (postve cass) -, ese (negatve cass) Geometrca we can nterpret ths behavour as foows (see fgure 5.): One can see that the nput space X s spt nto two parts b the so caed hperpane defned b the equaton w + b 0. Ths means, ever nput vector sovng ths equaton s drect part of the hperpane. A Hperpane s an affne subspace 7 of dmenson n- whch dvdes the space nto two haf spaces whch correspond to the nputs of the two dstnct casses. 7 A transaton of a near subspace of n R s caed an affne subspace. For eampe, an ne or pane n 3 R s an affne subspace. 37

39 In the eampe of fgure 5. n s, a two dmensona nput space, so the hperpane s smp a ne here. The vector w therefore defnes a drecton perpendcuar to the hperpane, so the drecton of the pane s unque, whe varng the vaue of b moves the hperpane parae to tsef. Whereb negatve vaues of b move the hperpane, runnng through the orgn, nto the postve drecton. Fgure 5.: A separatng hperpane (w,b) for a two dmensona tranng set. The smaer dotted nes represent the cass of hperpanes wth same w and dfferent vaues of b. In fact t s cear to see that f one wants to represent a possbe hperpanes n the space R the representaton s on possbe b nvovng n n + free parameters, n ones gven b w and one b b. But the queston that arses here s, whch hperpane to choose, because there are man possbe was n whch t can separate the data. So we need a crteron for choosng the best one, the optma separatng hperpane. The goa behnd supervsed earnng from eampes for cassfcaton can be restrcted to consderaton of the two-cass probem wthout oss of generat. In ths probem the goa s to separate the two casses b a functon, whch s nduced from avaabe eampes. The overa goa s to produce a cassfer (b fndng parameters w and b) that w work we on unseen eampes,.e. t generazes we. 38

40 So f the dstance between the separatng hperpane and the tranng ponts becomes too sma, even test eampes near to the gven tranng ponts woud be mscassfed. Fgure 5.3 ustrates ths behavour. Therefore t seems that the cassfcaton of unseen data s much more successfu n settng B than n settng A. Ths observaton eads to the concept of the mama margn hperpanes, or the optma separatng hperpane. Fgure 5.3: Whch separaton to choose? Amost zero margn (A) or arge margn (B)? In append B. we have a coser ook at an eampe wth a smpe teratve agorthm, separatng ponts from two casses b means of a hperpane, the so caed Perceptron. It s on appcabe on near separabe data. There we aso fnd some mportant ssues, aso stressed n the foowng chapters, whch w have a arge mpact on the agorthm(s) used n the Support Vector Machnes. 5. The Optma Separatng Hperpane for Lnear Separabe Data Defnton 5. (Margn) Consder the separatng hperpane H defned b w + b 0, wth both w and b normased b w w w and b w b. 39

41 The (functona) margn γ (w,b) of an eampe (, ) wth respect to H s defned as the dstance between and H: γ (w,b) ( w + b ) The margn γ S (w,b) of a set of vectors A {,, n } s defned as the mnmum dstance from H to the vectors n A: γ S (w,b) A mn γ (w,b) For carfcaton see fgures 5.4 and 5.5. Fgure 5.4: The (functona) margn of two ponts n respect to a hperpane In fgure 5.5 we have ntroduced two new dentfers: d + and d - : Let them be the shortest dstance from the separatng hperpane H to the cosest postve (negatve) eampe (the smaest functona margn from each cass). Then the geometrc margn s defned as d + + d -. 40

42 Fgure 5.5: The (geometrc) margn of a tranng set The tranng set s therefore sad to be optma separated b the hperpane, f t s separated wthout an error and the dstance between the cosest vectors to the hperpane s mama (mama margn) [Vap98].. So the goa s to mamze the margn As Vapnk showed n hs work [Vap98] we can assume canonca hperpanes n the upcomng dscusson wthout oss of generat. Ths s necessar because there ests the foowng probem: For an scang parameter c 0 : w + b c w + cb E.g. 0 * + * 0 A possbe souton s 0 Wth a parameter c of vaue 5, we w get 4

43 5 * 5 * 0 0 * + 0 * 0 0 whch can aso be soved b. 0 So (cw, cb) descrbe the same hperpane as (w, b) do. Ths means the hperpane s not descrbed unque! For unqueness, (w,b) awas need to be scaed b a factor c reatve to the tranng set. The foowng constrant s chosen to do ths: mn w + b Ths constrant scaes the hperpane n a wa such that the tranng ponts, nearest to t, get some mportant propert. Now the sove w + b + for of cass + and on the other sde, w + b for of cass -. A such scaed hperpane s caed a canonca hperpane. Reformuated ths means (mpng correct cassfcaton): ( + b) w ; (5.) Ths can be transformed nto the foowng constrants: w + b + for + w + b for - (5.3) Therefore t s cear to see, that the hperpanes H and H n fgure 5.5 are sovng w + b and w + b. The are caed margn hperpanes. Note that H and H are parae, the have the same norma w (as H does too), and that no other tranng ponts fa between them n the margn! The sove mn w + b. 4

44 Defnton 5. (Dstance) The Eucdan dstance d(w,b; ) of a pont beongng to a cass +, from the hperpane (w,b) that s defned b w + b 0 s, { } d( w, b; ) w w + b (5.4) As stated above, tranng ponts (, +) and (, -) that are nearest to the so scaed hperpane, respectve the e on H and H, have the dstance d + + and d - - from t (see fgure 5.5). Or reformuated wth equaton 5.4 and constrants 5.3, ths means: w + b and w + b w w + b w w and w w + b w w (5.4) + d and w d w So overa, as seen n fgure 5.5, the geometrc margn of a separatng canonca hperpane s d + + d -, and so w. As stated, the goa s to mamze ths margn. That s acheved b mnmsng w. The transformaton to a quadratc functon of the form Φ ( w ) w does not change the resut but w ease ater cacuaton. Ths s because we now sove the probem wth hep of the Lagrangan method. There are two reasons for dong so. Frst the constrants of (5.) w be repaced b constrants on the Lagrangan themseves, whch w 43

45 be much easer to hande (the are equates then). Second the tranng data w on appear n the form of dot products between vectors, whch w be a cruca concept ater n generazng the method on the nonnear separabe case and the use of kernes. And so the probem s reformuated n a conve one, whch s overa easer to hande b the Lagrangan method wth ts dfferentatons. Summarzng we have the foowng optmzaton probem to sove: Gven a near separabe tranng set S ((, ),, (, )) Mnmze w subject to w + b + for + w + b for - (5.5) The constrants are necessar to ensure unqueness of the hperpane, as mentoned above! Note: w w w, because w w w w w + w * w w n w n w * w n + w w w + w n w n Aso, the optmzaton probem s ndependent of the Bas b, because the provded equaton 5. s satsfed,.e. t s a separatng hperpane. So changng the vaue of b on moves t n the norma drecton to tsef. Accordng the margn remans unchanged but the hperpane woud no onger be optma. The probem of 5.5 s known as conve quadratc optmzaton 8 probem wth near constrants, and can be effcent soved b usng the method of the Lagrange Mutpers and the duat theor (see chapter 4). 8 Convet w be proofed n chapter

46 45 Mamze j j j W, ) ( j subject to 0 ; 0 The prma Lagrangan for (5.5) and the gven near separabe tranng set S ((, ),, (, )) s ( ) [ ] + P b b L ),, ( w w w w (5.6) where 0 are the Lagrange Mutpers. Ths Lagrangan L P has to be mnmzed wth respect to the prma varabes w and b. As seen n chapter 4, at the sadde pont the two dervatons wth respect to w and b must vansh (statonart), 0 w w w P b L ),, ( 0 ),, ( P b L b w obtanng the foowng reatons: w 0 (5.7) B substtutng the reatons (5.7) back nto L P one arrves at the so caed Wofe Dua of the optmzaton probem (now on dependabe on, no more w and b!): ( ) [ ] ) ( ),, (,,, w w w w j j W b b L j j j j j j j j j D + + (5.8) So the dua probem for (5.6) can be formuated: Gven a near separabe tranng set S ((, ),, (, )) (5.9)

47 Note: The matr j s known as the Gram Matr G. * So the goa s to fnd parameters whch sove ths optmzaton probem. As a souton to construct the optma separatng hperpane wth mama margn we obtan the optma weght vector: w * * (5.0) Remark: One can thnk that up to now, the probem w be abe to be soved eas as the one n append C wth the use of Lagrangan theor and the prma (dua) objectve functon. Ths coud be rght f havng nput vectors of sma dmenson, e.g.. But n the rea-word case the number of varabes w be over some thousand ones. Here sovng the sstem wth standard technques w not be practcabe n the case of tme and memor usage of the correspondng vectors and matrces. But ths ssue w be dscussed n the mpementaton chapter ater. 5.. Support Vectors Statng the Kuhn-Tucker (KT) condtons for the prma probem L P above (5.6), as seen n chapter 4, we get w b ( 0 [ L L P P w ( w ( ( w *, b, b * *,, * * * + b w * * * ) w ) * + b ) 0 * ) ] 0 * 0 0 (5.) As mentoned, the optmzaton probem for SVMs s a conve one (a conve functon, wth constrants gvng a conve feasbe regon). And for conve probems the KT condtons are necessar and suffcent for w *, b and to be a souton. Thus sovng the prma/dua probem of the SVMs 46

48 s equvaent to fndng a souton to the KT condtons (for the prma) 9 (see chapter 3, too). The ffth reaton n (5.) s known as the KT compementar condton. In the thrd chapter on optmzaton theor an ntenton was gven on how t works. In the SVM s probem t has a good graphca meanng. It states that for a gven tranng pont ether the correspondng Lagrange Mutper equas zero or, f not zero, es on one of the margn hperpanes (see fgure 5.4 and foowng tet) H or H : H H : : w + b + w + b On them are the tranng ponts wth mnma dstance to the optma separatng hperpane OSH (wth mama margn). The vectors ng on H or H, mpng > 0 are caed Support Vectors (SV). Defnton 5.3 (Support Vectors) A tranng pont s caed support vector, f ts correspondng Lagrange mutper > 0. A other tranng ponts havng 0 ether e on one of the two margn hperpanes (equat of (5.)) or on the sde of H or H (nequat of (5.)). A tranng pont can be on one of the two margn hperpanes, because the compementar condton n (5.) on states that that a SVs are on the margn hperpanes, but not that the SVs are the on ones on them. So there ma be the case where both 0 and ( w + b) 0. Then the pont es on one of the two margn hperpanes wthout beng a SV. Therefore SVs are the on ponts nvoved n determnng the optma weght vector n equaton (5.0). So the cruca concept here s that the optma separatng hperpane s unque defned b the SVs of a tranng set. That means, repeatng the tranng wth a other ponts removed or moved around wthout crossng H or H ead to the same weght vector and therefore to the same optma separatng hperpane. 9 on the w be needed because the prma/dua probem s a equvaent one, so we w mamze the dua (t s on dependabe on!) and as a crteron take the KT condtons of the prma. 47

49 In other words, a compresson has taken pace. So for repeatng the tranng ater, the same resut can be acheved b on usng the determned SVs. Fgure 5.6: The optma separatng hperpane (OSH) wth mama margn s determned b the support vectors (SV, marked) ng on the margn hperpanes H and H. Note that n the dua representaton the vaue of b does not appear and so the optma vaue b * has to be found makng use of the prma constrants: ( + b) 0 w ; So on the optma vaue of w s epct determned b the tranng procedure. Ths mpes we have optma vaues for. Therefore t s possbe to pck an 0, a support vector, and so, wth the substtuton w n the above nequat the upper constrant becomes an equat ( 0 because a support vector awas s part of a margn hperpane) and b can be computed. Numerca t s safer to compute b for a and take the mean vaue, or another approach as n the book [Ne00]: 48

50 * * ma ( ) mn ( ) * w + w b (5.) Note: Ths approach to compute the bas has been shown to be probematc wth regard to the mpementaton of the SMO agorthm, as showed b [Ker0]. Ths ssue w be dscussed n the mpementaton chapter ater. 5.. Cassfcaton of unseen data After the hperpanes parameters (w * and b * ) have been earned wth the tranng set we can cassf unseen/unabeed data ponts z. In the bnar case ( casses), dscussed up to now, the found hperpane dvdes the n * R nto two regons. One where w * + b > 0 and the other one where w * + b * < 0. The dea behnd the mama margn cassfer s to determne on whch of the two sdes the test pattern es and to assgn the abe correspondng wth - or + (as a cassfers) and aso to mamze the margn between the two sets. Hence the used decson functon can be epressed wth the optma parameters w * and b * SV and therefore b the found/used support vectors, * ther correspondng > 0 and b *. So overa the decson functon of the traned mama margn cassfer for some data pont z can be formuated: f ( z, *, b * ) sgn( w * z + b * ) sgn( sgn( SV * * z + b ) (5.3) * z + b * ) Whereb the ast reformuaton on sums over the eements, tranng pont, correspondng abe, assocated and the bas b, whch are assocated wth a support vector (SV), because on the have > 0 and therefore an mpact on the sum. A n a, the optma separatng hperpane we get b sovng the margn optmzaton probem s a ver smpe speca case of a Support Vector Machne, because t computes drect on the nput data. But t s a good startng pont for understandng the forthcomng concepts. In the net chapters the concept w be generazed to nonnear cassfers and there- 49

51 fore the concept of Kerne mappng w be ntroduced. But frst the adapton of the separatng hperpane on near non-separabe data w be done. 5.3 The Optma Separatng Hperpane for Lnear Non-Separabe Data The agorthm above for the mama margn cassfer cannot be used n man rea-word appcatons. In genera nos data w render near separaton mpossbe but the hugest probem w st be the used features n practce eadng to overappng casses. The man probem wth the mama margn cassfer s the fact, that t aows no cassfcaton errors durng tranng. Ether the tranng s perfect wthout an errors or there s no souton at a. Hence t s ntutve that we need a wa to rea the constrants of (5.3). But each voaton of the constrants needs to be punshed b a mscassfcaton penat,.e. an ncrease n the prma objectve functon L P. Ths can be reazed b ntroducng the so caed postve sack varabes ξ ( ) n the constrants frst and, as shown ater, ntroduce an error weght C, too: w + b + ξ for + w + b + ξ for - ξ 0 As above, these two constrants can be rewrtten nto one: ( w + b) + ξ 0 ; (5.4) So the ξ `s can be nterpreted as a vaue that measures how much a pont fas to have a margn (dstance to the OSH) of w. So t ndcates where a pont es, compared to the separatng hperpane (see fgure 5.7). ( w + ) < ξ b 0 mscassfcaton < ξ < s cassfed correct, but es nsde the margn 0 50

52 ξ 0 s cassfed correct and es outsde the margn or on the margn boundar So a cassfcaton error s marked b the correspondng ξ eceedng unt. Therefore ξ s an upper bound on the number of tranng errors. Overa wth the ntroducton of these sack varabes the goa s to mamze the margn and smutaneous mnmze mscassfcatons. To defne a penat on tranng errors the error weght C s ntroduced b C Ths parameter has to be chosen b the user. In practce, C s vared through a wde range of vaues and the optma performance s assessed usng a separate vadaton set or a technque caed cross-vadaton for verfng performance just usng the tranng set. ξ. Fgure 5.7: Vaues of sack varabes: () mscassfcaton f ξ s arger than the margn ( ξ ); () correct cassfcaton of ng n the margn wth 0 < correct cassfcaton of outsde the margn (or on t) wth ξ 0 < ξ ; (3) 5

53 So the optmzaton probem can be etended to Mnmze w + C ξ k subject to ( w + b) + ξ 0 ; ξ 0 (5.5) The probem s agan a conve one for an postve nteger k. Ths approach s caed the Soft Margn Generazaton, whe the orgna concept above s known as Hard Margn, because t aows no errors. The Soft Margn case s wde adapted to the vaues of k (-Norm Soft Margn) and k (-Norm Soft Margn) Norm Soft Margn - or the Bo Constrant For k, as above, the prma Lagrangan can be formuated as L P ( w, b, ξ,, ß) w w + C ξ [ ( w + b) + ξ ] ßξ wth 0; β 0. Note: As descrbed n chapter 4, we need another parameter ß here, because of the new nequat constrant ξ 0. As before, the correspondng dua representaton s found b dfferentatng L P wth respect to w, ξ and b: w ξ b L L L P P P w C β B resubsttutng these reatons back nto the prma we obtan the dua formuaton L D : 5

54 Gven a tranng set S ((, ),, (, )) Mamze LD ( w, b, ξ,, β) W ( ) j j j, j subject to 0 0 C (5.6) Ths probem s curous dentca to that for the mama (hard) margn one n (5.9). The on dfference s that C β 0 together wth β 0 enforces C. So n the soft margn case the Lagrange mutpers are upper bounded b C. The Kuhn-Tucker compementar condtons for the prma above are: [ ( w + b) + ξ ] 0 ; ξ ( C) 0 ; Another consequence of the KT condtons s that the mp that non-zero sack varabes ξ can on occur when β 0 and therefore C. The correspondng pont has a dstance ess than / w from the hperpane and therefore es nsde the margn. Ths can be seen wth the constrants (on shown for +, the other case s anaogous): w b w + b + + w w w for ponts on the margn hperpane. w w b ξ + b + ξ + < w w w w w And therefore ponts wth non-zero sack varabes have a dstance ess than / w. Ponts for whch 0 C then e eact at the target dstance of / w and therefore on one of the margn hperpanes ( ξ 0 ). Ths aso shows that the hard margn hperpane can be attaned n the soft margn case b settng C to nfnt ( ). 53

55 54 The fact that the Lagrange mutpers are upper bounded b the vaue of C gves the name to ths technque: bo constrant. Because the vector s constraned to e nsde the bo wth sde ength C n the postve orthant ( 0 ). Ths approach s aso known under SVM wth near oss functon Norm Soft Margn - or Weghtng the Dagona - Ths s the case for k. But before statng the prma Lagrangan and for ease of the upcomng cacuaton, note that for 0 < ξ the frst constrant of (5.5) st hods f 0 ξ. Hence we st obtan the optma souton when the postvt constrant on ξ s removed. So ths eads to the foowng prma Lagrangan: P b C b C b L ] ) ( [ ),,, ( ξ ξ ξ ξ w w w w w w ξ w wth 0 the Lagrange mutpers agan. As before the correspondng dua s found b dfferentatng wth respect to w, ξ and b, mposng statonart (.e. settng to zero): P P P L b C L L 0 0 ξ ξ 0 w w and agan resubsttutng the reatons back nto the prma to obtan the dua formuaton L D : ξ w j j + C C C b L j j j j j j D ),,, (,,

56 Usng the equaton jδ j, j, j δ where δ j s the Kronecker Deta, whch s defned to be f j and 0 otherwse. So on the rght sde of above equaton nsertng j changes nothng at the resut because s ether + or - and j δ j s the same as wrtng, and so we smp mutp etra b, but can smpf L D to get the fna probem to be soved: Gven a tranng set S ((, ),, (, )) j j j L D Mamze ( w, b, ξ, ) W ( ) j j ( j + δ j ) C, j subject to 0 0 ; (5.7) The compementar KT condtons for the prma probem above are [ ( w + b) + ξ ] 0 ; Ths whoe probem can be soved wth the same methods used for the mama margn cassfer. The on dfference s the addton of /C to the dagona of the Gram matr G j. On on the dagona, because of the Kronecker Deta. Ths approach s aso known under SVM wth quadratc oss functon. Summarzng ths subchapter t can be sad that the soft margn optmzaton s a compromse between tte emprca rsk and mama margn. For an eampe ook at fgure 5.8. The vaue of C can be nterpreted as representng the trade-off between mnmzng the tranng set error and mamzng the margn. So a n a, b usng C as an upper bound on the Lagrange mutpers, the roe of outers s reduced b preventng a pont from havng too arge Lagrange mutpers. 55

57 (a) (b) 56

(c) Fgure 5.8: Decson boundares arsng when usng a Gaussan kerne wth fed vaue of σ n the three dfferent machnes: (a) the mama margn SVM, (b) the -norm soft margn SVM and (c) the -norm soft margn SVM.

58 (c) Fgure 5.8: Decson boundares arsng when usng a Gaussan kerne wth fed vaue of σ n the three dfferent machnes: (a) the mama margn SVM, (b) the -norm soft margn SVM and (c) the -norm soft margn SVM. The data are an artfca created two dmensona set. The bue dots beng postve eampes and the red ones negatve eampes. 5.4 The Duat of Lnear Machnes Ths secton s ntended to stress the fact that was used and remarked severa tmes before. The near machnes ntroduced above can be formuated n a dua descrpton. Ths reformuaton w turn out to be cruca n the constructon of the more powerfu generazed Support Vector Machnes beow. But what does duat of cassfers mean? As seen n the former chapter the norma vector w can be represented as a near combnaton of the tranng ponts: w wth S ((, ),, (, )) the gven tranng set aread cassfed b the supervsor. The were ntroduced n the used Lagrange wa to fnd a 57

59 souton to the margn mamzaton probem. The were caed the dua varabes of the probem and therefore the fundamenta unknowns. On the wa to the souton we then obtan W ( ) j j j, j and the reformuated decson functon for unseen data z of (5.3): f ( z,, b) sgn( w z + b) sgn( z + b) The cruca observaton here s that the tranng and test ponts never act through ther ndvdua attrbutes. These ponts on appear as entres n the Gram Matr G j n the tranng phase and ater n the test phase the on appear n an nner product wth the tranng ponts z. 5.5 Vector/Matr Representaton of the Optmzaton Probem and Summar 5.5. Vector/Matr Representaton To gve a frst mpresson on how the above probems can be soved usng a computer, the probem(s) w be formuated n the equvaent notaton wth vectors and matrces. Ths notaton s more practca, understandabe and are used n man mpementatons. As descrbed above, the conve quadratc optmzaton probem whch arses for hard ( C ), -norm ( C < ) and -norm (change the Gram matr j b means of addng /C to the dagona) margn s the foowng: 58

60 Mamze LD ( w, b, ξ,, β) W ( ) j j j, j subject to 0 0 C Ths probem can be epressed as: Mamze subject to e T Q 0 C T 0 T (5.8) where e s the vector of a ones, C > 0 the upper bound, Q s a b postve semdefnte 0 matr, Q. j j And wth a correct tranng set S ((, ),, (, )) wth the ength of (5.8) woud ook ke: j... n (... ) (... ) Q Q... Q Q Q Q 5.5. Summar As seen n chapter 4 quadratc probems wth a so caed postve (sem-) defnte matr are conve functons. Ths aows the cruca concepts of soutons to conve functons to be adapted (see chapter 4: conve, KT). 0 Semdefnte: For each, Q 0 net page for epanaton T (Q has non-negatve egenvaues). Aso see 59

61 In former chapters the convet of the objectve functon has been assumed wthout proof. So et M be an (possb non-square) matr and set A M T M. Then A s a postve sem-defnte matr snce we can wrte T T T T A M M ( M) M M M M 0, (5.9) for an vector. If we take M to be the matr whose coumns are the vectors,, then A s the Gram Matr ( j ) of the set S (,, ), showng that Gram Matrces are awas postve sem-defnte. And therefore the above matr Q aso s postve sem-defnte. Summarzed, the probem to be soved up to now can be stated as Mamze LD ( w, b, ξ,, β) W ( ) j j j, j subject to 0 0 C (5.0) wth the partcuar smpe prma KT condtons as crterons for a souton to the -norm optmzaton probem: 0 C 0 0 < < C C ( ( * w ( * w * w + b * + b ) + b * * ) ) (5.) Notce that the sack varabes ξ do not need to be computed for ths case, because as seen n chapter 5.3., the w on be non-zero f C and β 0. And so reca the prma of ths chapter, stated as LP ( w, b, ξ,, ß) w w + C ξ [ ( w + b) + ξ ] ßξ Then set C, β 0, so the thrd sum s zero and from the second 60

62 6 sum we get C ξ whch s equvaent to C ξ and so t w be deeted and no sack varabe s there anmore. For the mama margn case the condtons w be: ) ( 0 ) ( 0 0 * * * * + < + b b w w (5.) And ast but not east for the -norm case: C b b + < + ) ( 0 ) ( 0 0 * * * * w w (5.3) The ast condton s reformuated b means of mpct defnng ξ wth hep of the prma KT condton 0 ξ ξ C L P of chapter 5.3. and therefore C ξ. And wth the compementar KT condton 0 ] ) ( [ + + b ξ w the thrd condton above s ganed. As seen n the soft margn chapters, ponts for whch the second equaton hods are Support Vectors on one of the margn hperpanes and for whch the thrd one hods are nsde the margn, therefore caed margn-errors. These KT condtons w be used ater and proof to be mportant when mpementng agorthms for computatona numerca sovng the probem of (5.0). Because a pont s an optmum of (5.0), f and on f the KT condtons are fufed and j j j Q s postve sem-defnte. The second requrement s proven above. And after the tranng process (the sovng of the quadratc optmzaton probem and as a souton gettng the vector and therefore bas b), the cassfcaton of unseen data z s performed b

63 f ( z, *, b * ) sgn( w * z + b * ) sgn( sgn( SV * * z + b ) (5.3) * z + b * ) where the are the tranng ponts wth ther correspondng greater than zero and upper bounded b C and therefore support vectors. As one can thnk now the queston arsng here s wh awas cassf new data b the use of the and wh not smp savng the resutng weght vector w? Sure up to now t w be possbe to do that and so no further need of havng to store the tranng ponts and ther abes. But as seen above there w be ver few support vectors norma and on them and ther correspondng and are necessar to reconstruct w. But the man reason w be gven n chapter 5, where we w see that we must use the and not smp store w. To gve a short nk to the mpementaton ssue dscussed ater, t can be sad that n most cases the -norm s used, because n rea-word appcatons ou norma w not have nose-free, near separabe data, and therefore the mama margn approach w not ead to satsfactor resuts. But the man probem s st the seecton of the used feature data n practce. The -norm s used n fewer cases, because t s not eas to ntegrate n the SMO agorthm, dscussed n the mpementaton chapter. 6

64 Chapter 6 Nonnear Cassfers The ast chapter showed how the near cassfers can eas be computed b means of standard optmzaton technques. But near earnng machnes are restrcted because of ther mted computatona power as hghghted n the 960 s b Mnsk and Papert. Summarzed t can be stated that rea-word appcatons requre more epressve hpothess spaces than near functons. Or n other words, the target concept ma be too compe to be epressed as a smpe near combnaton of the gven attrbutes (That s what near machnes do), equvaent to: the decson functon s not a near functon of the data. Ths probem can be overcome b the use of the so caed kerne technque. The genera dea s to map the nput data nonnear to a (near awas) hgher dmensona space and then separate t ther b near cassfers. Therefore ths w resut n a nonnear cassfer n nput space (see fgure 6.). Another souton to ths probem has been proposed n the neura network theor: Mutpe aers of threshoded near functons whch ed to the deveopment of mut-aer neura networks. Fgure 6.: Smper cassfcaton task b a feature map(φ). -dmensona Input space on the eft, -dmensona feature space on the rght, where we are abe to separate b a near cassfer whch eads to the nonnear cassfer n nput space. 63

65 6. Epct Mappngs Now the representaton of tranng eampes w be changed b mappng the data to a (possb nfnte dmensona) Hbert space F. Usua the space F w have a much hgher dmenson than the nput space X. The mappng Φ : X F s apped to each abeed eampe before tranng and then the optma separatng hperpane s constructed n the space F. Φ : X n n F (,... ) a Φ( ) ( Φ ( )... Φ ( )) (6.) Ths s equvaent to mappng the whoe nput space X nto F. The components of Φ () are caed features, whe the orgna quanttes are sometmes referred to as the attrbutes. F s caed the feature space. The task of choosng the most sutabe representaton of the data s known as feature seecton. Ths can be a ver dffcut task. There are dfferent approaches estng to feature seecton. Frequent one seeks to dentf the smaest set of features that st conves the essenta nformaton contaned n the orgna attrbutes. Ths s known as dmensonat reducton,... ) a Φ() ( Φ ( ),... Φ ( )), d n (6.) ( n d < and can be ver benefca as both computatona and generazaton performance can degrade as the number of features grows, a phenomenon known as the curse of dmensonat. The dffcutes one s facng wth hgh dmensona feature spaces s, that snce the arger the set of (probab redundant) features s, the more ke t s that the functon to be earned coud be represented usng a standardsed earnng machne. Another approach to feature seecton s the detecton of rreevant features and ther emnaton. As an eampe consder the gravtaton aw, whch on uses nformaton about the masses and the poston of two bodes. So an rreevant feature woud be the coour or the temperature of the two bodes. So as a ast word to sa on feature seecton, t shoud be consdered we as a part of the earnng process. But t s aso natura a somewhat arbtrar step, whch needs some pror knowedge on the underng target functon. Therefore recent research has been done on the technques for Is a vector space wth some more restrctons. A space H s separabe f there est a countabe subset D H, such that ever eement of H s the mt of a sequence of eements of D. A Hbert space s a compete separabe nner product space. Fnte dmen- n sona vector spaces ke R are Hbert spaces. Ths space w be descrbed n deta a tte further n ths chapter and for further readngs see [Ne00]. 64

66 feature reducton. However n the rest of the dpoma we do not tak about the feature seecton technques because as Chrstann and Shawe- Taor proofed n ther book [Ne00] we can afford to use nfnte dmensona feature spaces and avod computatona probems b the means of the mpct mappng, descrbed n the net chapter. So the curse of dmensonat can be sad to be rreevant b mpct mappng the data, aso known as the Kerne Trck. Before ustratng the mappng wth an eampe, frst notce that the on wa n whch data appears n the tranng probem s n the form of dot products j. Now suppose ths data s frst mapped to some other (possbe nfnte dmensona) space F, usng the mappng of (6.): Φ : R n F Then of course, as seen n (6.) and (6.) the tranng agorthm woud on depend on the data through dot products n F,.e. on functons of the form Φ ) Φ( ) (a other varabes are scaars). Second there s no vector ( j mappng to w va Φ, but we can wrte w n the form w Φ ) and the whoe hpothess (decson) functons w be of the tpe f ( ) sgn( w Φ ( ) + b), or reformuated f ( ) sgn( Φ( ) Φ() + b). So a support vector machne s constructed whch ves n the new hgher dmensona space F but a the consderatons of the former chapters w st hod, snce we are st dong a near separaton, but n dfferent space. But now a smpe eampe wth epct mappng. ( Consder a gven tranng set S ((, ),, (, )) of ponts n R wth cass abes + and -: S {(, + ),(0, ),( +, + }. Trva these three ponts are not separabe b a hperpane, here a pont, n R (see fgure 3 6.). So frst the data s nonnear mapped to the R b appng Input dmenson s, therefore the hperpane s of the dmenson 0, and dm(0) s defned as. 65

67 66 R R, : 3 Φ Fgure 6.: A non-separabe eampe n the nput space R. The hperpane woud be a snge pont but t cannot separate the data ponts. Ths Step resuts n a tranng set consstng of the vectors, 0 0, wth the correspondng abes (+, -, +). As ustrated n fgure 6.3 the souton n the new space 3 R can be eas seen geometrca n the Φ Φ -pane (see fgure 6.4). It s therefore 0 0 w, whch s amost normazed et meanng t has a ength of, and the bas b becomes b -0.5 (negatve b means movng the hperpane runnng through the orgn n the postve drecton ). So t can be seen that the earnng task can be eas soved n the 3 R b near separaton. But how does the decson functon ook ke n the orgna space R, where we need t?

68 Remember that w can be wrtten n the form w Φ ). ( Fgure 6.3: Creaton of a separatng hperpane,.e. a pane n the new space 3 R. Fgure 6.4: Lookng at the ΦΦ -pane, the souton to w and b can be eas gven b geometrc nterpretaton of the pcture. 67

69 68 And n our partcuar eampe t can be wrtten as : ) Φ( w And worked out: * * 0 0 * ) * ( * * The sovng vector 3 s then. Wth the equaton ) ( ) ( ) ( Φ Φ (6.3) The hperpane n R then becomes wth the orgna tranng ponts n R : ( ) 0 + b z w ; ) * * ( ) ( ) * * ( z z z z z z z (z) ) ( Φ Φ Ths eads to the nonnear hperpane n R consstng of two ponts: z and z. As seen n equaton (6.3), the nner product n the feature space has a equvaent functon n the nput space. Now we ntroduce an abbrevaton for the dot product n feature space: Φ(z) Φ() z : ), ( K (6.4) Cear, that f the feature space s ver hgh-dmensona, or even nfnte dmensona, the rght-hand sde of (6.4) w be ver epensve to com-

70 pute. The observaton n (6.3) together wth the probem descrbed above motvates to search for was to evauate nner products n feature space wthout makng drect use of the feature space nor the mappng Φ. Ths approach eads to the terms Kerne and Kerne Trck. 6. Impct Mappngs and the Kerne Trck Defnton 6. (Kerne Functon) Gven a mappng Φ : X F from nput space X to an (nner product) feature space 3 F, we ca the functon K : X X R a kerne functon f for a, z X K (, z) Φ() Φ(z). (6.5) The kerne functon then behaves ke an nner product n feature space but can be evauated as a functon n nput space. For eampe take the ponoma kerne K(, ). Now assume we have got d and, R (orgna nput space), so we get: d 3 Inner product space: A vector space X s caed a nner product space f there ests a bnear map (near n each argument) that for each two eements, X gves a rea number denoted b satsfng the foowng propertes: 0, and 0 0 e.g.: n, (... n ), ( X R... ), λ are fed postve numbers. Then the foowng defnes a vad nner product: n n λ T A where A s the n n dagona (on dagona non-zero) matr wth non-zero entres λ. A 69

71 70 Φ() Φ() (6.6) So the data s mapped to the 3 R. But the second ne can be eft out b mpct cacuatng Φ() Φ() wth the vectors n nput space: ) ( what s the same as n the above cacuaton frst mappng the nput vectors to the feature space and then cacuatng the dot product: + + Φ() Φ(). So b mpct mappng the nput vectors to the feature space we are abe to cacuate the dot product there wthout even knowng the underng mappng Φ! Summarzed t can be stated that b mpct performng such a non-near mappng to a hgher dmensona space, t can be done wthout ncreasng the number of parameters, because the kerne functon computes the nner product n feature space on b use of the two nputs n nput space. To generaze here, a ponoma kerne d K ), ( wth d and attrbutes n nput space of dmenson n maps the data to a feature space of dmenson + d d n 4. In the eampe of (6.6) ths means wth n and d : 4 )!!(! k n k n k n, caed the bnoma coeffcent

72 n + d + d 3!!*! 3 * * 3 * * And as can be seen above the data s rea mapped from the 3 R. R to the In fgure 6.5 the whoe new procedure for cassfcaton of an unknown pont z s shown, after tranng of the kerne-based SVM and therefore havng the optma weght vector w (defned b the s, the correspondng tranng ponts and ther abes ) and the bas b. Fgure 6.5: The whoe procedure for cassfcaton of a test vector z (n ths eampe the test and tranng vectors are smpe dgts). To stress the mportant facts, summarzng t can be sad that n contrast to the eampe n chapter 6. the chan of arguments s nverted n that wa, that there we started b epct defnng a mappng Φ before appng the earnng agorthm. But now the startng pont s choosng a kerne functon K whch mpct defnes the mappng Φ and therefore avodng the feature space n the computaton of nner products as we n the whoe desgn of the earnng machne tsef. As seen above both the earnng and test step on depend on the vaue of nner products n feature space. Hence, as shown, the can be formuated n terms of kerne functons. So once such a kerne functon has been chosen, the decson functon for unseen data z, (5.3), becomes: 7

73 f ( z) sgn( sgn( w Φ ( z) + b) K(, z) + b) (6.7) And as sad before as a consequence we do not need to know the underng feature map to be abe to sove the earnng task n feature space! Remark: As remarked n chapter 5, the consequence of usng kernes s that now the drect storng of the resutng weght vector w s not practcabe, because as seen n (6.7) above, then we have to know the mappng and cannot use the advantage arsng b the usage of kernes. But whch functons can be chosen as kernes? 6.. Requrements for Kernes - Mercer s Condton - As a frst requrement for a functon to be chosen as a kerne, defnton 6.5 gves two condtons because the mappng has to be to an nner product feature space. So t can be eas seen that K has to be a smmetrc functon: K (, z) Φ() Φ(z) Φ(z) Φ() K( z, ) (6.8) And another condton for an nner product space s the Schwarz Inequat: K(, z) Φ() Φ(z) Φ() Φ() Φ() Φ(z) Φ(z) Φ(z) K(,) K( z,z) (6.9) However these condtons are not suffcent to guarantee the estence of a feature space. Here Mercer s Theorem gves suffcent condtons (Vapnk 995; Courant and Hbert 953). The foowng formuaton of Mercer s Theorem s gven wthout proof, as t s stated n the paper of [Bur98]. 7

74 Theorem 6. (Mercer s Theorem) There est a mappng Φ and an epanson K (, ) Φ() Φ() Φ ( ) Φ ( ) f and on f, for an g() such that g( ) d < (s fnte) (6.0) then K (, ) g( ) g( ) dd 0 (6.) Note: (6.0) has to hod for ever g satsfng (6.0). Ths theorem s aso suffcent for the nfnte case. Another smpfed condton for K to be a kerne n the fnte case can be seen from (6.8), (6.9) and when descrbng K wth ts egenvectors and egenvaues (The proof s gven n [Ne00]). Proposton 6.3 Let X be a fnte nput space wth K(, z) a smmetrc functon on X. Then K(, z) s a kerne functon f and on f the matr K s postve sem-defnte. Therefore Mercer s Theorem s an etenson of ths proposton based on the stud of ntegra operator theor. 6.. Makng Kernes from Kernes Theorem 6. s the basc too for verfng that a functon s a kerne. The remarked proposton 6.3 gves the requrement for a fnte set of ponts. Now ths crteron for a fnte set s apped to confrm that a number of new kernes can be created. The net proposton of Chrstann and John Shawe-Taor [Ne00] aows creatng more compcated kernes from smpe budng bocks: 73

75 Proposton 6.4 Let KandK be kernes over a rea-vaued functon on X, X n X, X R, + a R, f() Φ : X R m wth K 3 a kerne over R m R m,p() a ponoma wth postve coeffcents and B a smmetrc postve sem-defnte n n matr. Then the foowng functons are kernes, too: K(, z) K (, z) + K (, z) K(, z) ak (, z) K(, z) K (, z) K (, z) K(, z) f()f(z) K(, z) K 3 ( Φ(), Φ(z) ) (6.) K(, z) T Bz K(, z) p( K (, z)) K(, z) ep( K (, z)) 6..3 Some we-known Kernes The seecton of a kerne functon s an mportant probem n appcatons athough there s no theor to te whch kerne when to use. Moreover t can be ver dffcut to check that some partcuar kerne satsfes Mercer s condtons, snce the must hod for ever g satsfng (6.0). In the foowng some we known and wde used kernes are presented. Seecton of the kerne, perhaps from among the presented ones, s usua based on eperence and knowedge about the cassfcaton probem at hand, and aso theoretca consderatons. The probem of choosng a kerne and ts parameters on the bass of theoretca consderatons w be dscussed n chapter 7. Each kerne w be epaned beow. 74

76 Ponoma K(, z) ( ) p z + c (6.3) Sgmod K(, z) tanh( κ z δ ) (6.4) Rada Bass Functon z - Gaussan Kerne - K(, z) ep( ) (6.5) σ Ponoma Kerne Here p gves the degree of the ponoma and c s some non-negatve constant, usua c. Usage of another generazed nner-product nstead of the standard nner product above was proposed n man other works on SVMs because the Hessan matr then becomng zero n numerca cacuatons (ths means, no souton for the optmzaton probem). Then the Kerne w become: K(, z) z σ + c p Where the vector σ s such chosen that the functon satsfes Mercer s condton. 75

77 Fgure 6.6: A ponoma kerne of degree used for the cassfcaton of the nonseparabe XOR-data set (n nput space, b a near cassfer). Each coour represents one cass and the dashed nes mark the margns. The eve of shadng ndcates the functona margn, or n other words: The darker the shadng of one coour representng a specfc cass, the more confdent the cassfer s that ths pont n that regon beongs to that cass Sgmod-functon The Sgmod kerne stated above usua satsfes Mercer s condton on for certan vaues of the parameters κ and δ. Ths was notced epermenta b Vapnk. Current there are no theoretca resuts on the parameter vaues that satsf Mercer s condtons. As stated n [Pan0] the usage of the sgmod kerne wth the SVM can be regarded as a two-aer neura network. In such networks the nput vector z s mapped b the frst aer nto the vector F ( F...FN ), where F tanh( κ z δ ), N and the dmenson of F s caed the number of the Hdden Unts. In the second aer the sgn of the weghted sum of the eements of F s cacuated b usng weghts γ. Fgure 6.7 ustrates that. The man dfference to notce between SVMs and two-aer neura networks s the dfferent optmzaton crteron: In the SVM case the goa s to fnd the optma separatng hperpane whch mamzes the margn (n the feature space), whe n a two-aer neura network the crteron s usua to mnmze the emprca rsk assocated wth some oss functon, tpca the mean squared error. 76

78 F γ z F γ ŷ + / -... γ N F N Fgure 6.7: A -aer neura network wth N hdden unts. Output of the frst aer are of the form F tanh( κ z δ ), N. Whe the output of the whoe network then becomes ˆ sgn( γ F b). N Another mportant notce shoud be gven here: In neura networks the optma network archtecture s qute often unknown and most found on b eperments and/or pror knowedge, whe n the SVM case such probems are avoded. Here the number of hdden unts s the same as the number of support vectors and the vector of the weghts n the output aer ( γ ) are a determned automatca n the near separabe case (n feature space) Rada Bass Functon (Gaussan) The Gaussan Kerne s aso known as the Rada Bass Functon. In the above functon (6.5), σ (varance) defnes a so caed wndow wdth (wdth of the Gaussan). Sure t s possbe to have dfferent wndow wdths for dfferent vectors, meanng to use a vector σ (see [Cha00]). As some works show [Ln03], the RBF-kerne w be a good startng pont for a frst tr f one knows near nothng about the data to cassf. The man reasons w be stated n the upcomng chapter 7, where aso the parameter seecton w be dscussed. 77

79 Fgure 6.8: A SVM wth a Gaussan Kerne, a vaue of sgma σ 0. and wth the appcaton of the mama margn case (C nf) on an artfca generated tranng set. Another remark to menton here s that up to now the agorthm, and so the above ntroduced cassfers, are on ntended for the bnar case. But as we w see n chapter 8 ths can be eas etended to the Mutcass case. 6.3 Summar Kernes are a ver powerfu too when deang wth nonnear separabe datasets. The usage of the Kerne Trck has ong been known and therefore was studed n deta. B ts usage the probem to sove now st stas the same as n the prevous chapters, but the dot product n the formuas s rewrtten, usng the mpct kerne mappng. So the probem can be stated as: L Mamze w, b, ξ,, β) W ( ) D ( j j K( ; j, j ) subject to 0 0 C (6.6) 78

80 79 And wth the same KT-condtons as n the summar under Then the overa decson functon for some unseen data z becomes: (6.7) Note: Ths Kerne-representaton w now be used and to gve the nk to the near case of chapter 5 where ) ; ( j K s repaced b j, ths kerne w be caed the Lnear Kerne. ) ) ; ( sgn( ) ) ; ( sgn( ) sgn( ),, ( * * * * * * * SV b K b K b b f z z z w z *

81 Chapter 7 Mode Seecton Though as ntroduced n the ast chapter, wthout budng an own kerne based on the knowedge about the probem at hand, as a frst tr t s ntutve to use the four common and we known kernes. Ths approach s man used as the eampes n append A w show. But as a frst step there s the choce whch kerne to use for the begnnng. Afterwards the penat parameter C and the kerne parameters have to be chosen, too. 7. The RBF Kerne As suggested n [Ln03] the RBF kerne s n genera a reasonabe frst choce. Athough f the probem at hand s near the same as some aread formdabe soved ones (hand dgt recognton, face recognton, ), whch are documented n deta, a frst tr shoud be gven to the kernes used there. But the parameters most have to be chosen n other ranges appcabe to the actua probem. Some eampes of such aread soved probems and nks to further readngs about them w be gven n append A. As shown n the ast chapter the RBF kerne, as others, maps sampes nto a hgher dmensona space so t s abe to, n contrast to the near kerne, hande the case where the reaton between cass abes and the attrbutes s nonnear. Furthermore, the near kerne s a speca case of the RBF one as [Ke03] shows that the near kerne wth a penat parameter C has the same performance as the RBF kerne wth some parameters (C, γ ) 5. In addton, the sgmod kerne behaves ke RBF for certan parameters [L03]. Another reason s the number of hperparameters whch nfuence the compet of mode seecton. The ponoma kerne has more of them than the RBF kerne. 5 γ σ 80

82 Fna the RBF kerne has ess numerca dffcutes. One ke pont s 0 < K j n contrast to ponoma kernes of whch the kerne vaues ma go towards nfnt. Moreover, as sad n the ast chapter, the sgmod kerne s not vad (.e. not the nner product of two vectors) under some parameters. 7. Cross-Vadaton In the case of RBF kernes there are two tunng parameters: C and σ. It s not known beforehand whch vaues for them are the best for the probem at hand. So some parameter search must be done to dentf the optma ones. Optma ones means, fndng C and σ so that the cassfer can accurate predct unknown data after tranng,.e. testng data. In ths wa t w not be usefu to acheve hgh tranng accurac b the cost of generazaton abt. Therefore a common wa s to separate the tranng data nto two parts of whch one s consdered unknown n tranng the cassfer. Then the predcton accurac on ths set can more precse refect the performance on cassfng unknown data. An mproved verson of ths technque s known as cross-vadaton. In the so caed k-fod cross-vadaton, the tranng set s dvded nto k subsets of equa sze. Sequenta one subset s tested usng the cassfer traned on the remanng k subsets. Thus, each nstance of the whoe tranng set s predcted once, so the cross-vadaton accurac s the percentage of data whch are correct cassfed. The man dsadvantage of ths procedure s ts computatona ntenst, because the mode has to be traned k tmes. A more smpe technque can be etracted from ths mode b choosng k-fod cross-vadaton wth k. Ths means sequenta removng the -th tranng subset and tran wth the remanng subsets. Ths procedure s known as Leave-One-Out (oo). Another technque s known as Grd-Search. Ths approach has been chosen b [Ln03]. The man dea behnd ths s to basca tr pars of ( C, γ ) and the one wth the best cross-vadaton accurac s pcked. Mentoned n ths paper s the observaton, that trng eponenta growng sequences of C and γ s a practca wa to fnd good parameters. Ths means, e.g., C,,..., ; γ,,...,. Sure ths search method s straghtforward and stupd n some wa. But as sad n the paper above, sure there are advanced technques for grd-searchng, but the are an ehaustve parameter search b appromaton or heurstcs. Another reason s that t has been shown that the computatona tme to fnd good parameters b the orgna grd-search s not much more than that b advanced methods, snce there are st the same two parameters to be optmzed. 8

83 Chapter 8 Mutcass Cassfcaton Up to now the stud has been mted to the two-cass case, caed the bnar case, where on two casses of data have to be separated. However, n rea-word probems there are n genera m casses to dea wth. The n tranng set st conssts of pars (, ) where R but now {,..., n} wth. The frst straghtforward dea w be to reduce the Mutcass probem to man two-cass probems, so each resutng cass s separated from the remanng ones. 8. One-Versus-Rest (OVR) So as mentoned above the frst dea for a procedure to construct a mutcass cassfer s the constructon of n two-cass cassfers wth foowng decson functons: f z) sgn( w k z + b ) ; k n (8.) k ( k Ths means that the cassfer for cass k separates ths cass from a other casses: f k () + f beongs to cass k - otherwse So the step-b-step procedure starts wth cass one: construct frst bnar cassfer for cass (postve) versus a others (negatve), cass versus a others,., cass k (n) versus a others. The resutng combned OVR decson functon chooses the cass for a sampe that corresponds to the mamum vaue of k bnar decson functons (.e. the furthest postve hperpane). For carfcaton see fgure 8. and tabe 8.. Ths whoe frst approach to gan a Mutcass cassfer s computatona ver epensve, because there s need of sovng n 8

84 quadratc programmng (QP) optmzaton probems of sze (tranng set sze) now. As an eampe consder a three-cass probem wth near kerne ntroduced n fgure 8.. The OVR method eds a decson surface dvded b three separatng hperpanes (the dashed nes). The shaded regons n the fgure correspond to te stuatons, where two or none cassfers are actve,.e. vote postve at the same tme (aso see tabe 8.). Fgure 8.: OVR apped to a three-cass (A, B, C) eampe wth near kerne Now consder the cassfcaton of a new unseen sampe (heagona n fgure 8.) n the ambguous regon 3. Ths sampe receves postve votes from both the A-cass and C-cass bnar cassfers. However the dstance of the sampe from the A-cass-vs.-a hperpane s arger than from the C-cass-vs.-a one. Hence, the sampe s cassfed to beong to the A cass. In the same wa the ambguous regon 7 wth no votes s handed. So the fna combned OVR decson functon resuts n the decson surface separated b the sod ne n fgure 8.. Notce however that ths fna decson functon sgnfcant dffers from the orgna one, whch corresponded to the souton of k (here 3) QP optmzaton probems. The major drawback here s therefore that on three ponts (back bas n fgure 8.) of the resutng bordernes concde wth the orgna ones, cacuated b the n Support Vector Machnes. So t seems that the benefts of mama 83

85 margn hperpanes are ost. Summarzed t can be sad that ths s the smpest Mutcass SVM method [Krs99 and Stat]. Regon A vs. B and C Decson of the cassfer B vs. A and C C vs. A and B Resutng cass - B C? - - C C 3 A - C? 4 A - - A 5 A B -? 6 - B - B ? Tabe 8.: Three bnar OVR cassfers apped to the correspondng eampe (fgure 8.). The coumn Resutng cass contans the resutng cassfcaton of each regon. Ces wth? correspond to te stuatons when two or none cassfer are actve at the same tme. See tet for how tes are resoved. 8. One-Versus-One (OVO) The dea behnd ths approach s to construct a decson functon f : R n { +, } for each par of casses (k, m) ; k, m n: km f km () + f beongs to cass k - f beongs to cass m n n( n ) So n tota there are pars, because ths technque nvoves the constructon of the standard bnar cassfer for a pars of casses. In other words, for ever par of casses, a bnar SVM s soved wth the underng optmzaton probem to mamze the margn. The decson functon therefore assgns an nstance to a cass, whch has the argest number of votes after the sampe has been tested aganst a decson func- 84

86 n n( n ) tons. So the cassfcaton now nvoves comparsons and n each one the cass to whch the sampe beongs n that bnar case, get s a added to ts number of votes ( Ma Wns strateg). Sure that there can st be te stuatons. In such a case, the sampe w be assgned based on the cassfcaton provded b the furthest hperpane as n the OVR case [Krs99 and Stat]. As some researchers have proposed, ths can be smpfed b choosng the cass wth the owest nde when a te occurs, because even then the resuts are st most accurate and appromated enough [Ln03] wthout addtona computaton of dstances. But ths has to be verfed for the probem at hand. The man beneft of ths approach s that for ever par of casses the optmzaton probem to dea wth s much smaer,.e. n tota there s on need of sovng n(n-)/ QP probems of sze smaer than (tranng set sze), because there are on two casses nvoved and not the whoe tranng set n each probem as n the OVR approach. Agan consder the three-cass eampe from the prevous chapter. Usng the OVO technque wth a near kerne, a decson surface s dvded b three separate hperpanes (dashed nes) obtaned b the bnar SVMs (see fgure 8.). The appcaton of Ma Wns strateg (see tabe 8.) resuts n the dvson of the decson surface nto three regons (separated b the thcker dashed nes) and the sma shaded ambguous regon n the mdde. After the te-brakng strateg from above (furthest hperpane) apped to the ambguous regon 7 n the mdde, the fna decson functon becomes the sod back nes and the thcker dashed ones together. Notce here that the fna decson functon does not dffer sgnfcant from the orgna one correspondng to the souton of n(n-)/ optmzaton probems. So the man advantage here n contrast to the OVR technque s the fact, that the fna bordernes are parts of the cacuated par wse decson functons, whch was not the case n the OVR approach. 85

87 Fgure 8.: OVO apped to the three cass eampe (A, B, C) wth near kerne Regon Decson of the cassfer A vs. C B vs. C A vs. B Resutng cass C C B C C C A C 3 A C A A 4 A B A A 5 A B B B 6 C B B B 7 C B A? Tabe 8.: Three bnar OVO cassfers apped to the correspondng eampe (fgure 8.). The coumn Resutng cass contans the resutng cassfcaton of each regon accordng to Ma Wns strateg. The on ce wth? corresponds to the te stuaton when three cassfer are actve at the same tme. See tet for how ths te s resoved. 86

88 8.3 Other Methods The above methods are on some of the ones usabe for Mutcass SVMs, but the are the most ntutve ones. Other methods are e.g. the usage of bnar decson trees whch are near the same as the OVO method. For detas on them see [Pcs00]. Another method was proposed b Weston and Watkns ( WW ) [Stat and WeW98]. In ths technque the n-cass case s reduced to sovng a snge quadratc optmzaton probem of the new sze (n-)* whch s dentca to bnar SVMs for the case n. There est some speed-up technques for ths optmzaton probem, caed decomposton [Stat], but the man dsadvantage s that the optmat of ths method s not et proven. An etenson to ths was gven b Cramer and Snger ( CS ). There the same probem of the WW approach has to be soved, but the managed to reduce the number of sack varabes n the constrants of the optmzaton probem, and hence t s computatona cheaper. Aso there est some technques known as decomposton for speed-up [Stat]. But unfortunate, same as above, the optmat has not been demonstrated et. But whch method s sutabe now for a certan probem? As shown n [WeW98] and other papers the optma technque s most the WW approach. Ths method has shown the best resuts n comparson to OVO, OVR and the bnar decson trees. But as ths method s not proven et to be optma and there s some need of reformuaton of the probem, t s not eas to mpement. As a good compromse the OVO method coud be chosen. Ths method s man used b the actua mpementatons and has shown to produce good resuts [Ln03]. Vapnk hmsef has used the OVR method, what s man attrbuted to the smaer steps for computng. Because n the OVR case there s on a need for constructng n hperpanes, one for each cass, whe n the OVO case there are nstead n(n-)/ ones to compute. So the use of the OVR technque decreases the computatona effort b a factor of (n-)/. The man advantage, compared wth the WW method, s that n OVR (as n OVO) one s abe to choose dfferent kernes for each separaton, whch s not possbe n the WW case, because t s a jont computaton [Vap98]. 87

89 Part III Impementaton 88

90 Chapter 9 Impementaton Technques In the prevous chapters t was showed that the tranng of Support Vector Machnes can be reduced to mamzng a conve quadratc functon wth subject to near constrants (see chapter 5.5.). Such conve quadratc functons have on one oca mama (the goba one) and ther souton can awas be found effcent. Furthermore the dua representaton of the probem showed how the tranng coud be successfu performed even n ver hgh dmensona feature spaces. The probem of mnmzng dfferentabe functons of man varabes has been wde studed, especa n the conve case, and most of the standard technques can be drect apped to SVM tranng. However there est specfc technques to epot partcuar features of ths probem. For eampe the arge sze of the tranng set s a formdabe obstace to a drect use of standard technques, snce just storng the kerne matr requres a memor space that grows quadratca wth the sampe sze. 9. Genera Technques A number of optmzaton technques have been devsed over the ears, and man of them can be drect apped to quadratc programms. As eampes thnk of the Newton method, conjugate gradent or the prma dua nteror-pont methods. The can be apped to the case of Support Vector Machnes straghtforward. Not on ths, the can aso be consderab smpfed because of the fact, that the specfc structure of the objectve functon s gven. Conceptua the are not ver dfferent from the smpe gradent ascent 6 strateg, known from the Neura Networks. But man of ths technques requre that the kerne matr s stored compete n memor. The quadratc form n (5.8) nvoves a matr that has a number of eements equa to the square of the number of tranng eampes. Ths matr then e.g. cannot ft nto a memor of sze 8 Megabtes f there are more than 4000 tranng eampes (assumng each eement s stored as an 8-bte doube precson number). So for arge sze probems the approaches de- 6 For an adapton to SVMs, see [Ne00] 89

91 scrbed above can be neffcent or even mpossbe. So the are used n conjuncton wth the so caed decomposton technques ( Chunkng and Decomposton, for epanaton see [Ne00]). The man dea behnd ths methods s to subsequent optmze on a sma subset of the probem n each teraton. The man advantages of such technques s that the are we understood and wde avaabe n a number of commerca and freeware packages. These were man used for Support Vector Machnes before speca agorthms were deveoped. The most common agorthms were, for eampe, the MINOS package from the Stanford Optmzaton Laborator (hbrd strateg) and the LOQO package (prma dua nteror-pont method). In contrast to these, the quadratc program subroutne qp provded n the MATLAB optmzaton toobo s ver genera but the routne quadprog s sgnfcant better than qp. 9. Sequenta Mnma Optmzaton (SMO) The agorthm used n near an mpementaton of SVMs n a sght changed manner and n the one of ths dpoma thess, too, s the SMO agorthm. It was deveoped b John C. Patt [Pa00] and ts man advantage besdes beng one of the most compettve s the fact that t s smpe to mpement. The dea behnd ths agorthm s derved b takng the dea of the decomposton method to ts etreme and optmzng a mnma subset of just two ponts at each teraton. The power of ths approach resdes n the fact that the optmzaton probem for two data ponts admts an anatca souton, emnatng the need to use an teratve quadratc program optmzer as part of the agorthm. So SMO breaks the arge QP probem nto a seres of smaest possbe QP probems and soves them anatca, whch avods usng a tme-consumng numerca QP optmzaton as an nner oop. Therefore the amount of memor requred for SMO s near n the tranng set sze, no more quadratca, whch aows SMO to hande ver arge tranng sets. The computaton tme of SMO s man domnated b SVM evauaton, whch w be seen beow. The smaest possbe subset for optmzaton nvoves two Lagrange mutpers, because the mutpers must obe the near equat constrant (of 5.0) 0 and therefore updatng one mutper k, at east one other mutper p ( k p, and 0 < k, p ) has to be adjusted n order to keep the condton true. 90

92 At ever step, SMO chooses two Lagrange mutpers to jont optmze, fnds the optma vaues for them, and updates the SVM to refect the new optma vaues. So the advantage of SMO, to repeat t agan, es n the fact that sovng for two Lagrange mutpers can be done anatca. Thus, an entre nner teraton due to numerca QP optmzaton s avoded. Even though more optmzaton sub-probems are soved now, each sub-probem s so fast sovabe, such that the overa QP probem can be soved quck (comparson between the most common used methods can be found n [Pa00]). In addton, SMO does not requre etra matr storage (gnorng the mnor amounts of memor requred to store an matrces requred b SMO). Thus, ver arge SVM tranng probems can ft even nsde of the memor of an ordnar persona computer. The SMO agorthm man conssts of three components: An anatc method to sove for the two Lagrange mutpers A heurstc for choosng whch mutpers to optmze A method for computng the bas b As even mentoned n chapter 5.. the computaton of the bas b can be probematc, when smp takng the average vaue for b after summng up a cacuated b s for each. Ths was shown b [Ker0]. The man probem arsng when usng an averaged vaue of the bas for recacuaton n the SMO agorthm s, that the convergence speed of t s not guaranteed. Sometmes t s sower and sometmes t s faster. So Keerth suggested an mprovement for the SMO agorthm where two threshod vaues b up and b ow are used nstead of one. It has been shown n ths paper that the modfed SMO agorthm s more effcent on an tested dataset n contrast to the orgna one. The speed-up s sgnfcant! But as a frst ntroducton the orgna SMO agorthm w be used here and can be etended ater. Before contnung, one dsadvantage of the SMO agorthm shoud be stated here. In the orgna form mpemented n near an toobo, t cannot hande the -norm case. Because the KT-condtons are others, as can be seen n chapter Therefore near an toobo, whch wants to mpement the -norm case, uses optmzaton technques mentoned above. On one mpements the - and -norm case at the same tme wth an etended form of the SMO agorthm (LbSVM b Chh- Jen Ln). The -norm case w aso be added to the deveoped SMO agorthm n ths dpoma thess. As t w be seen, SMO w spend most of the tme evauatng the decson functon, rather than performng QP, t can epot data sets whch contan substanta number of zero eements. Such sets w be caed sparse. 9

93 9.. Sovng for two Lagrange Mutpers Frst reca the genera mathematca formuated probem: Mamze LD ( w, b, ξ,, β) W ( ) j j j, j subject to 0 0 C Wth the foowng KT condtons fufed, f the QP probem s soved for a (for mama-margn and -norm): 0 C 0 0 < < C C ( ( * w ( * w * w + b * + b ) + b * * ) ) For convenence, a quanttes referrng to the frst mutper w have a subscrpt and those referrng to the second a subscrpt. Wthout the other subscrpt od, the are meant to be the just optmzed vaues new. od For ntazng s set to zero. In order to take the step to the overa souton two s are pcked and SMO cacuates the constrants on these two mutpers and then soves for the constraned mamum. Because there are on two constrants now, the can be eas dspaed n two dmenson (see fgure 9.). The constrants 0 C cause the Lagrange mutpers to e nsde a bo, whe the near equat constrant 0 causes them to e on a dagona ne. Thus, the constraned mamum of the objectve functon W () must e on a dagona ne segment (epanaton n fgure 9. and foowng pages). In other words, to not voate the near constrant on the two mutpers od od the must fuf: + const. + (e on a ne) n the bo constraned b 0, C. So ths one-dmensona probem resutng from the restrcton of the objectve functon to such a ne can be soved anatca. 9

94 Fgure 9.: Two cases of optmzaton: and. The two Lagrange mutpers chosen for subset optmzaton must fuf a of the constrants of the fu probem. The nequat constrants cause them to e nsde a bo and the near equat constrant causes them to e on a dagona ne. Therefore, one step of SMO must fnd an optmum od m of the objectve functon on a dagona ne segment. In ths fgure, γ +, whch s a constant that depends on the prevous vaues of, and m { + ; }. ( γ + + m ) od od m od Wthout oss of generat, the agorthm frst computes the second mutper and computes the ends of the dagona ne segments n terms of ths one. So t s successve used to obtan. The bounds on the new mutper can be formuated more restrctve wth use of the near constrant and the equat constrant (aso see fgure 9.). But frst reca for each : 0 C and aso the near constrant has to hod: 0. Usng the two actua mutpers to be optmzed we wrte and therefore + γ where γ. There are two cases to consder (remember {, + } ): 3 93

95 Fgure 9.: Case :. γ, γ ' and the two nes, ndcatng the cases where > or < Case : then γ (9.) Case : then + γ (9.) Then et m, then the two above equatons can be wrtten as od od and before optmzaton γ +. + m γ (9.3) m Then the end ponts of the searched dagona ne (fgure 9. and 9.3) can be epressed wth hep of the od, possb not optmzed vaues: Case : od γ od L ( at the ower end pont) s: od od ma (0, γ ) ma (0, ) H ( at the hgher end pont) s: od od mn (C, C γ ) mn (C, C + ) 94

96 Fgure 9.3: Case :. γ, γ ' and the two nes, ndcatng the cases where > or < Case : od + od γ L ( at the ower end pont) s: od ma (0, γ C ) ma (0, od C ) + H ( at the hgher end pont) s: od mn (C, γ ) mn (C, + od ) As a summar the bounds on are: where, f : L ma (0, H mn (C, ) od od od od C + ) od od and f : L ma (0, + C ) od od H mn (C, + ) (9.4) 95

97 L H At a frst gance, ths on appears to be appcabe to the -norm case, but treatng C as nfnte for the hard-margn case reduces the constrants on the nterva [L, H]: L H where, f : L ma (0, on ower bounded and f : L 0 od H + ) od od od Now that the other s are assumed fed, the objectve functon T W (, ) LD can be rewrtten (as abbrevaton j j s wrtten here as j ): L D ( + + const. ( 3 )( + ) + const.) + + ) + const. are the parts dependabe on the mutpers not optmzed n ths step, so the are regarded as constant vaues smp added. Now for smpfcaton assume the foowng substtutons: K, K, K and v j 3 j As n fgure 9., assume m and wth the equat constrant we get + const., mutped wth eadng to γ m ( ) od od where γ + + m. m And resubsttutng a these reatons back nto L D the formua becomes: 96

98 97 j j j b K u f E + ) ) ( ( ) ( j, ;, ( ). const v v K m K K L D Where. const s j j j K j 3, 3 And b usng the hep ofγ to on have a functon dependabe on : ) (.. ) ) ( ) ( ) ( ( γ γ γ γ γ γ γ γ γ W const v m v v K mk K K m K K m const v m v m mk K m K m L D To fnd the mamum of ths functon there s need for the frst and second dervate of W wth respect to : where m The foowng new notaton w smpf the statement. ) ( f s the current hpothess functon ) ( b + w determned b the vaues of the actua vector and the bas b at a partcuar stage of earnng. So the foowng new ntroduced eement E s the dfference between the functon output (cassfcaton b the up to now traned machne) and the target cassfcaton (gven b the supervsor n the tranng set) on the tranng ponts or. Meanng ths s the tranng error on the th eampe. (9.5) ) ( ) ( ) ( v v m mk K K m mk m v mv K m K K K m K m W γ γ γ γ η K K K W

99 Ths vaue ma be arge even f a pont s correct cassfed. As an eampe f and the functon output s f ( ) 5, the cassfcaton s correct but E 4. Reca the substtuton v j so from 9.5 u s wrtten as: 3 j u j j 3 v j j K K j j b + j j od b + od K K + od + od K K b and so v u u + b + b od od K K od m K od K (9.6) u j j 3 v j j K K j j b + j j od b + od K K + od + od K K b and so v od od u + b K K (9.7) W At the mama pont the frst dervate s zero and the second one has to be negatve. Hence ( K + K K ) m( K K ) γ + ( v v ) + m And wth equatons 9.6 and 9.7 ths becomes (remember m and ): K K + K K m + ( v v ) + mγ ( K ) 98

100 ( η ) mγ ( K + od m( + m + od ( u od K od od K ( η ) u K ( K od K m + ) + od + ( u ( E )( K m + K od K ( u E ) + K K u ) + b u + ) + ) od K + ) ( E b) od K od E K ) od + m K od m K od K od K m od + m + m K od K od + K od K m od K So the new mutper can be epressed as: new od ( E E η ) (9.8) Ths s the unconstraned mamum, so ths has to be constraned to e wthn the ends of the dagona ne, meanng L new H (see fgure 9.): H f new H new,cpped f L new < H new < L f new L (9.9) new The vaue of s obtaned from equaton new new, cpped od od + + m and therefore m new od + m( od new, cpped ) (9.0) As stated above, the second dervate has to be negatve to ensure a mamum. But under unusua crcumstances t w not be negatve. A zero second dervate can occur f more than one tranng eampe has the same nput vector. In an event, SMO w work even f the second der- 99

101 vate s not negatve, n whch case the objectve functon W shoud be evauated at each end of the ne segment. Then SMO uses the Lagrange mutpers at the end pont, whch eds the hghest vaue of the objectve functon. These crcumstances are regarded and soved n the net subchapter about choosng the Lagrange mutpers to be optmzed. 9.. Heurstcs for choosng whch Lagrange Mutpers to optmze The SMO agorthm s based on the evauaton of the KT condtons. Because when ever mutper fufs these condtons of the probem, the souton s found. These KT condtons norma are verfed to a certan toerance eve ε. As Patt mentoned n hs paper, the vaue of ε s tpca n the range of 0 3 to 0 mpng that e.g. outputs on the postve (+) margn are between and.00. Norma ths toerance s enough when usng an SVM for recognton. Appng hgher accurac the agorthm w not converge ver fast. There are two heurstcs used for choosng the two mutpers to optmze. od The choce of the frst heurstc for provdes the outer oop of the SMO agorthm. Ths oop frst terates over the entre tranng set, determnng whether an eampe voates the KT condtons. If so, then ths eampe s mmedate chosen for optmzaton. The second eampe, and therefore od the canddate for s found b the second choce heurstc and then these two mutpers are jont optmzed. At the end of ths optmzaton, the SVM s updated and the agorthm resumes teratng over the tranng eampes ookng for KT voators. To speed up the tranng, the outer oop does not awas terate over the entre tranng set. After one pass through the tranng set, the outer oop on terates those eampes whose Lagrange mutpers are nether 0 nor C (the non-bound eampes). Then agan, each eampe s checked aganst the KT condtons, and voatng ones are chosen for mmedate optmzaton and update. So the outer oop makes repeated passes over the non-bound eampes unt a of them obtan the KT condtons wthn the toerance eve ε. Then the outer oop terates over the whoe tranng set agan to fnd voators. So a n a the outer oop keeps aterng between snge passes over the whoe tranng set and mutpe passes over the non-bound subset unt the entre set obes the KT condton wthn the toerance eve ε. At ths pont the agorthm termnates. Once the frst Lagrange mutper to be optmzed s chosen, the second one has to be found. The heurstc for ths one s based on mamzng the step that can be taken durng jont optmzaton. Evauatng the Kerne functon for dong so w be tme-consumng, so SMO uses an appromaton on the step sze b usng equaton (9.8). So the mamum possbe step sze s the one havng the bggest vaue E E. To speed up, a 00

102 cached error vaue E s kept for ever non-bound eampe from whch SMO chooses the one to appromate mamze the step sze. If E s postve, then the eampe wth mnmum error E s chosen. If E s negatve, then the eampe wth argest error E s chosen. Under unusua crcumstances, as the ones remarked at the end of the ast sub-chapter (two dentca tranng vectors), SMO cannot make postve progress usng ths second choce heurstc. To avod ths, SMO uses a herarch of second choce heurstcs unt t fnds a par of mutpers, makng postve progress. If there s no postve progress usng above appromaton, the agorthm starts teratng through the non-bound eampes at a random poston. If none of them makes postve progress the agorthm starts teratng through the entre tranng set at a random poston to od fnd a sutabe mutper that w make postve progress n the jont optmzaton. The randomness n choosng the startng poston s used to avod bas towards eampes stored at the begnnng of the tranng set. In ver etreme degeneratve cases, a second mutper makng postve progress cannot be found. In such cases the frst mutper s skpped and a new one s chosen Updatng the threshod b and the Error Cache Snce the sovng for the Lagrange mutpers does not determne the threshod b of the SVM, and there s need for updatng the vaue of the error cache E at the end of each optmzaton step, the vaue of b has to be re-evauated after each optmzaton. So b s re-computed after each step, so that the KT condtons are fufed for both optmzed eampes. Now et u be the output of the SVM wth the od and : u od od K + K + 3 od K b (9.) u (9.) E As n fgure 9.4, f the new s not at the bounds, then the output of the SVM after optmzaton on eampe w be, ts abe vaue. And therefore: new new, cpped K + K + 3 K b (9.3) And substtutng (9.3) and (9.) nto (9.) : 0

103 b od new od new, cpped od E + b + ) K + ( ) ( K (9.4) Smar obtanng an equaton for b, such that the output of the SVM after optmzaton s when s not at the bounds: b od new od new, cpped od E + b + ) K + ( ) ( K (9.5) When both b and b are vad, the are equa (see fgure 9.4 agan). When both new cacuated Lagrange mutpers are at the bound and f L s not equa to H, then the nterva [ b, b ] descrbes a threshod consstent wth the KT condtons. Then SMO chooses b to be b + b b new. Ths formua s on vad, f b s subtracted from the weghted sum of the kernes, not added. If one mutper s at the bound and the other one not, then the vaue of b cacuated usng the non-bound mutper s used as the new updated threshod. As mentoned above, ths step s regarded as probematc b [Ker0]. But to avod ths, the orgna SMO agorthm dscussed here has to be modfed n ts whoe and therefore on a reference to the mproved agorthm s gven here. The modfed pseudo code w be stated together wth the orgna one n the append. As seen n the former chapter, a cached error vaue E s kept for ever eampe whose Lagrange mutper s nether zero nor C (non-bound). So f a Lagrange mutper s non-bound after beng optmzed, ts cached error s zero (t s cassfed correct). Whenever a jont optmzaton occurs, the stored error of the other not nvoved mutpers have to be updated usng the foowng equaton: E E new new E E od od u + u new new u u od od And re-substtuted ths becomes: E new od new od new, cpped od E + ( ) K + ( ) K + b od b new (9.6) 0

104 Fgure 9.4: Threshod b when both s are bound ( C). The support vectors A and B gve the same threshod b, that s the dstance of the optma separatng hperpane from the orgn. Pont D and E gve b and b respectve. The are error ponts wthn the margn. The searched b s somewhere between b and b. Overa, when an error vaue E s requred b the SMO agorthm, t w ook t up n the error cache f the correspondng Lagrange mutper s not at bound. Otherwse, t w evauate the current SVM decson functon (cassf the gven pont and compare t to the gven abe) based on the current s Speedng up SMO There are certan ponts n the SMO agorthm, where some usefu technques can be consdered to speed up the cacuaton. As sad n the summar on near SVM, t s possbe there to store the weght vector drect, rather than a of the tranng eampes that correspond to non-zero Lagrange mutpers. But ths optmzaton s on possbe for the near kerne. After the jont optmzaton succeeded, the stored weght vector must be updated to refect the new Lagrange mutpers found. Ths update s eas, due to the neart of the SVM: new od new od new, cpped od w w + ( ) + ( ) 03

105 Ths s a speed-up because much of the computaton tme n SMO s spent to evauate the decson functon, and therefore speedng up the decson functon speeds up SMO. Another optmzaton that can be made s usng the sparseness of the nput vectors. Norma, an nput vector s stored as a vector of foatng pont numbers. A sparse nput vector (wth zeros n t) s stored b the meanng of two arras: d and va. The d arra s an nteger arra storng the ocaton of the non-zero nputs, whe the va arra s a foatng pont arra storng the correspondng non-zero vaues. Then the ver often used computaton of the dot product between such stored vectors (d, va, engthnum) and (d, va, engthnum) can be done quck, as shown n the pseudo code beow: p 0, p 0, dot 0 whe (p < num && p < num) { a d[p], a d[p] f (a a) { dot + va[p]*va[p] p++, p++ } ese f (a > a) p++ ese p++ } Ths can be used to cacuate near and ponoma kernes drect. Gaussan kernes can aso use ths optmzaton through the usage of the foowng dentt: + To speed up more n the Gaussan case, for ever nput the dot product wth tsef can be pre-computed. Another optmzaton technque for near SVMs regards the weght vector agan. Because t s not stored as a sparse arra, the dot product of the weght vector wth a sparse nput vector (d, va) can be epressed as: 04

106 05 num va d w 0 ] [ * ]] [ [ And for bnar nputs storng the arra va s not even necessar, snce t s awas. Therefore the dot product cacuaton n the pseudo code above becomes a smpe ncrement and for a near SVM the dot product of the weght vector wth a sparse nput vector becomes: num d w 0 ]] [ [ As mentoned n Patt s paper there are more speed-up technques that can be used but the w not be dscussed n deta here The mproved SMO agorthm b Keerth In hs paper [Ker0] Keerth ponts out some dffcutes encountered n the orgna SMO agorthm b epct usng the threshod b for checkng the KT condtons. Hs modfed agorthm w be stated here as Pseudo-Code wth a tte epanaton, but for further detas pease refer to Keerth s paper. Keerth uses some new notatons: Defne F w. Now the KT condtons can be epressed as: 0 ) ( 0 ) ( 0 ) ( b F b F b F and these can be wrtten as: b F I I I b F I I I where 0}, : { }, : { }, : { 0}, : { } 0 : { < < I C I C I I C I C C < < 0 0

107 And now to check f the KT condtons hod, Keerth aso defnes: b b up ow mn{ F : I ma{ F : I 0 I 0 I I } F 3 _ up I } F 4 _ ow (A) 7 (B) The KT condtons then mp b b and smar I0 I I, F b ow and I I3 I 4, F bup These comparsons do not use the threshod b! up ow 0. od As an added beneft, gven the frst, these comparsons automatca fnd the second mutper for jont optmzaton. The pseudo code, as t can be found n Keerth s paper, can be found n append D. As seen n the pseudo code and n Keerth s paper, there are two modfcatons on the SMO agorthm. Both were tested n the paper on dfferent datasets and showed a sgnfcant speed-up n contrast to the orgna SMO agorthm b Patt. Aso the overcome the probem arsng when on usng a snge threshod (an eampe, wh there are arsng probems can aso be found n Keerth s paper). As a concuson on a tests Keerth showed that the second modfcatons fares better overa SMO and the -norm case As stated before, the SMO agorthm s not abe to hande the -norm case wthout aterng the code. Reca that there are two dfferences to the mama margn and the -norm case: Frst the addton of /C to the dagona of the kerne matr and second the atered KT condtons, whch are used n SMO as the stoppng crteron: 0 0 ( 0 < ( w w * * + b + b * * ) ) C And as the orgna SMO agorthm tests the KT condtons on n the outer oop when seectng the frst mutper to optmze, ths s the pont to ater. Aso the kerne evauaton has to be etended to add the dagona vaues. In the pseudo-code above, the checkng of the KT condtons s processed b: 7 (A) and (B) are nks to the pseudocode n the append 06

108 E SVM output on pont[] - (check n error cache) r E* f ((r < -to) (r > to && aph > 0)) where r s the same as f ( ). So the KT condtons are tested aganst > 0 and < 0, where 0 s repaced b the toerance to. So for the -norm case the test s rewrtten as: E SVM output on pont[] - (check n error cache) r E* + aph/c f ((r < -to && aph < C) (r > to && aph > 0)) Second, as n the mama margn case, the bo constrant on the mutpers has to be removed, because the are no onger upper bounded b C. And ast but not east, the bas has to be cacuated on usng aphas fufng the equaton 0 < * * ( w + b ). C 9.3 Data Pre-processng As one can read n [Ln03] the have some propostons on the handng of the used data Categorca Features SVMs requre that each data nstance s represented as a vector of rea numbers. Hence, f there are categorca attrbutes, the have to frst be converted nto numerc data. Cheng recommends to use m numbers for representng an m-categor attrbute. Then on one of the m numbers s one, and the others are zero. Consder the three categor attrbute {red, green, bue} whch then can be represented as (0,0,), (0,,0) and (,0,0). Cheng s eperence ndcates that f the number of vaues n an attrbute s not too man, ths codng mght be more stabe than usng a snge number to represent a categorca attrbute. 07

109 9.3. Scang Scang the data before appng t to an SVM s ver mportant. [Ln03] epans wh the scang s so mportant, and most of these consderatons aso app to SVMs. The man advantage s to avod attrbutes n greater numerc ranges domnatng those n smaer numerc ranges. Another advantage s to avod numerca dffcutes durng the cacuaton. Because kerne vaues usua depend on the nner products of feature vectors, arge attrbutes ma cause numerca probems. So Cheng recommends near scang each attrbute to the range of [-, +] or [0, ]. In the same wa, the testng data then has to be scaed before testng t on the traned machne. In ths dpoma thess the most used scang to [-, +] s used and the accordng formua for scang an nput n ths nterva wth ength two s: The components of and nput : ( n ) T are near scaed to the nterva [-, +] wth a ength of two b appng:,mn,mn, sca,ma,mn,ma, mn wth {,,..., n}. The scang has to be done for each feature separate. So the mn- and ma-vaues are taken, regardng the current feature n each vector. To go n deta, the reason for dong ths s foows: Imagne a vector of features (-dmensona), the frst has a vaue of 5, the second of Assume the other vectors behave the same wa. So the frst feature woud not have a ver great mpact on dstngushng between the casses, because the change n feature one s numerca ver sma n contrast to that of feature two, whose numbers are n a much hgher range. Other ong studed methods for scang the data and showng ver good resuts use the co-varance matr from the Gaussan theor. 9.4 Matab Impementaton and Eampes Ths chapter s ntended to show some eampes and to get an mpresson how the dfferent tuneabe vaues, such as the penat C, the kerne parameters and the choce of mama margn, -norm or -norm, affect the resutng cassfer. 08

110 The mpementaton n Matab wth the orgna SMO agorthm can be found here together wth the tranng sets (these fes were used for makng the foowng pctures possbe): Matab Fes\SVM\ 8 It shoud be mentoned, that the SMO mpementaton n Matab s somewhat ver sow. Therefore near an Toobo for SVMs avaabe wrtten n Matab mpements the SMO agorthm as C-Code and cas t n Matab through the so-caed Me-functons (Interface to C/Matab). But for eamnng the sma eampes used here, the use of pure Matab s acceptabe. Later the whoe code for Support Vector Machnes w be mpemented n C++ anwa to be ntegrated n the Neura Network Too aread estent at Semens VDO. For an upcomng vsuasaton the dmenson of the tranng and test vectors s restrcted to the two-dmensona case, because on such eampes are vsuazeabe two- and three-dmensona, to be dscussabe. The three-dmensona pctures w show the vaues cacuated b the earned decson functon wthout appng the cassfcaton b the sgnum functon sgn to t on the z-as. The boundar w be shaded too, respectve to the functona margn of that pont. Or n other words: The darker the shadng, the more the pont beongs to that specfc cass. The pctures w gve carfcaton on ths Lnear Kerne For eampes usng the near kerne, the near separabe cases of the bnar functons OR and AND are consdered (fgure 9.5 and 9.6). The dashed nes represent the margn. The sze of the functona margn s ndcated b the eve of shadng. A test of the same machne on the XOR case resuts n a cassfcaton wth one error because of the nature of the XOR functon to be non separabe n nput space (fgure 9.6). 8 A compete st wth the usage and a short descrpton of each fe w be gven n chapter 0. 09

111 Fgure 9.5: A near kerne wth mama margn (C nf) apped to the near separabe case of the bnar OR functon. Fgure 9.6: A near kerne wth mama margn (C nf) apped to the near separabe case of the bnar AND functon. 0

Fgure 9.7: A near kerne wth soft margn (C 000) apped to the near nonseparabe case of the XOR functon. The error s 0.5 %, as one pont s mscassfed. 9.4.

112 Fgure 9.7: A near kerne wth soft margn (C 000) apped to the near nonseparabe case of the XOR functon. The error s 0.5 %, as one pont s mscassfed Ponoma Kerne As seen before, the XOR case s non-separabe n nput space. Therefore the usage of a kerne mappng the data to a hgher space and separatng t there near coud produce a cassfer n nput space, separatng the data correct. To test ths, a ponoma kerne wth mama margn (C nf) of degree two s used. The resut can be seen n fgure 9.8. To get an mpresson on how ths data becomes separabe b mappng t to a hgher dmensona space, the three-dmensona pcture n fgure 9.9 vsuazes the output of the cassfcaton step before appng the sgnum (sgn) functon to t on the z as.

113 Fgure 9.8: A ponoma kerne of degree wth mama margn (C nf) apped to the XOR dataset.

114 Fgure 9.9: The cassfcator of fgure 9.8 vsuazed b showng the cacuated vaue of the cassfcaton on the z as before the appcaton of the sgnum (sgn) functon. Here one can see that the eow regons appng to one of the casses have greater postve vaues and the green regon appng to the other cass has vaues ower than zero. The change of separaton from one cass to the other s at the zero eve of the cassfer output (z as), as the sgnum functon changes sgn there. The man concuson drawng from the pctures up to now and from further ones s, that the appcaton of a kerne measures the smart between the data n some wa. Because regardng the ast two fgures agan, one can see that the ponts beongng to the same cass are mapped to the same drecton (output vaues > 0 or < 0). The upcomng pctures on the Gaussan kerne w stress ths fact Gaussan Kerne (RBF) As stated n the chapter on the kernes, f one has no dea on how the data s dependabe, as a frst start the Gaussan kerne(s) or n other words, the rada bass functon(s) s/are a good choce. Sure n the XOR case appng ths kerne w be the same as shootng wth canons on sparrows, but the pctures resutng from dong so anwa, stress the fact that a kerne measures the smart of data n some wa (the resutng vaue before appng the sgnum functon). Another fact s 3

that here the resut of changng the sgma vaue (varance, wndow wdth, see 6..3) can be seen qute cear. Fgure 9. 0: The RBF kerne apped to the XOR data set wth σ 0. and mama margn (C nf).

115 that here the resut of changng the sgma vaue (varance, wndow wdth, see 6..3) can be seen qute cear. Fgure 9. 0: The RBF kerne apped to the XOR data set wth σ 0. and mama margn (C nf). To see how the change of the sgma vaue (varance) affects the resutng cassfer, compare fgures 9.0 and 9. to fgures 9. and 9.3. Notce the smoother and wder course of the curves at the gven tranng ponts. 4

116 Fgure 9.: The cassfcator of fgure 8.9 ( σ 0. ), vsuazed b showng the cacuated vaue of the cassfcaton on the z as before the appcaton of the sgnum (sgn) functon. Remarkabe are the Gauss curves at the poston of the four gven tranng ponts (Here the cassfcator s more confdent that a pont n that regon beongs to the specfc cass). Fgure 9.: The RBF kerne apped to the XOR data set wth σ 0. 5 and mama margn (C nf). 5

117 Fgure 9.3: : The cassfcator of fgure 9.9 ( σ 0. 5 ), vsuazed b showng the cacuated vaue of the cassfcaton on the z as before the appcaton of the sgnum (sgn) functon. Remarkabe are the Gauss curves at the poston of the four gven tranng ponts. But n contrast to fgure 9. wth a vaue of sgma σ 0. the are much smoother and wder, as sgma changes the wdth (Consder the affect of the varance n Gaussan dstrbuton). To get an mpresson on how dfferent vaues of the penat parameter C (soft margn case for 0 < C < nf) affect the resutng cassfer the net pctures ustrate ths appcaton of C. As a startng pont assume the cassfcaton probem of fgure 9.4, cassfed b a SVM wth a Gaussan kerne usng σ 0. and the mama margn concept, aowng no tranng errors. The resutng cassfcaton regons are not ver smooth, due to the two tranng ponts ng n the mdst of the other cass. Therefore appng the same machne on the dataset but wth the soft margn approach b appng the upper bound b settng C to fve resuts n the cassfer of pcture 9.5. Here the whoe decson boundar s much smoother than n the mama margn case. The man advantage s the broader margn, mpng a better generazaton. Ths fact s aso stressed n the fgures of 9.6 and the net sub chapter. 6

118 Fgure 9.4: A Gaussan kerne wth σ 0. and mama margn (C nf). The dashed margns are not rea wde, because of the two ponts ng n the mdst of the other cass and the appcaton of the mama margn cassfer (no errors aowed). 7

119 Fgure 9.5: A Gaussan kerne wth σ 0. and soft margn (C 5). Ths approach gves smoother decson boundares n contrast to the cassfer n fgure 8.4 but at the epense of mscassfng two ponts now The Impact of the Penat Parameter C on the Resutng Cassfer and the Margn Now the change of the resutng cassfer (boundar, margns) when appng the mama margn and the soft margn approach w be anazed n deta. Assume the tranng set used n fgure 9.6. The SVM used there s based on a Gaussan kerne appng the concept of the mama margn approach, aowng no tranng error (C nf). As one can see, the resutng cassfer does not have a ver broad margn. And therefore, as stated n the Theor on Generazaton n part one of ths dpoma thess, ths cassfer s assumed not to generaze ver we. In contrast to ths the approaches n fgures 9.7 to 9.9 use the soft margn optmzaton and resut n a broader margn but ths on the epense of aowng tranng errors. But such errors can aso be nterpreted as the cassfer does not overestmate the nfuence of some outers n the tranng set (because of such ones the h n fgure 9.6 s n the mdst of where one can magne the other cass shoud be). 8

Fgure 9.6: A SVM wth a Gaussan kerne wth σ 0. 8 and mama margn (C nf). The resutng cassfer s compatbe wth the tranng set wthout error, but has no broad margn.

120 Fgure 9.6: A SVM wth a Gaussan kerne wth σ 0. 8 and mama margn (C nf). The resutng cassfer s compatbe wth the tranng set wthout error, but has no broad margn. So these cassfers are assumed to generaze better n ths case, what s the goa of a cassfer: He must generaze ver we but mnmze the error of cassfcaton. As stated n chapter two another ver genera estmaton of the generazaton error of SVMs are the number of support vectors ganed after tranng: # SV So sma numbers of support vectors are epected to gve better generasaton. Another advantage n practce s, that the fewer support vectors there are the ess epensve s the computaton of the cassfcaton of a pont. So to summarze, as the theor on generazaton stated, a broad margn and few support vectors are ndcatons for good generazaton. So the appcaton of the soft margn approach can be seen as a compromse between mnor emprca rsk and mnor optmsm. 9

The boundar has become smoother and the three (four, one s a margn error, the others are rea errors) mscassfed

121 Fgure 9.7: A SVM wth a Gaussan kerne wth σ 0. 8 and soft margn (C 00). Notce the broader margn n contrast to fgure 9.6. The boundar has become smoother and the three (four, one s a margn error, the others are rea errors) mscassfed ponts do not have as much mpact on the boundar as n fgure 9.6. Fgure 9.8: A SVM wth a Gaussan kerne wth σ 0. 8 and soft margn (C 0). Notce the broader margn n contrast to fgure 9.6 and 9.7. The boundar s much more smoother. 0

122 Fgure 9.9: A SVM wth a Gaussan kerne wth σ 0. 8 and soft margn (C ). Notce the broader margn n contrast to fgure 9.6,9.7 and 9.8, and the much smoother boundar.

123 Part IV Manuas, Avaabe Tooboes and Summar

124 Chapter 0 Manua As sad at the begnnng, one of the goas was to mpement the theor nto a computer program for practca usage. Ths program was frst deveoped n Matab Reease for better debuggng and dong demandng graphca output. A fgures from the ast chapter were done wth ths mpementaton and after readng ths chapter ou shoud aso be abe to use the fes created. After testng the whoe theor there etensve the code was ported to C++ n a modue to be ntegrated nto the aread estent Neura Network Too. 0. Matab Impementaton Frst the Matab approach was used because of the better debuggng possbtes of the agorthm. Aso the deveopment was faster here because of the mathematca nature of the probem. But the man advantage was the graphca output aread possbe wth Matab. Fgure 0.: The dsk structure for a fes assocated wth the Matab mpementaton The net tabe summarzes a fes created for the Matab mpementaton. An eampe on ther usage can be seen after t. 3

125 Path Fe Descrpton Remarks Cassfer kerne_func Interna used for kerne cacuaton Bnar Case kerne_eva SMO Evauate the chosen kerne functon for the gven data Fes assocated wth the -cass probem Impementaton of the orgna SMO-Agorthm Mutpe return vaues SMO_Keerth cassf_pont Impementaton of the mproved SMO-Agorthm b Keerth wth modfcaton Cassfcaton of an unabeed data eampe after tranng Mutpe return vaues Vaue wthout apped sgnum functon (sgn) Mutcass Mutcass_SMO Above SMO for the mutcass case Mutpe return vaues Testdata Mutcass_SMO_K eerth Mut_Cassf _Pont Above mproved SMO for the mutcass case Cassf a pont after tranng Contans *.mat fes wth preabeed test data for oadng nto the workspace Mutpe return vaues Vector contanng a votes for each cass s returned; st te stuatons! Ut checkddata createdata Interna used b createdata Create -dmensona preabeed tranng data for two- and mu- Up to now on the foowng cang conven- 4

126 tcass n a GUI; saveabe to a fe tons are supported: Createdata for two-cass case Vsua Bnar case nscae maketwo- Cass svcpotd Scaes the data to the nterva [-, +] near If the data s not stored wth abes - and + for bnar cassfcaton, ths functon rewrtes them. Fes for pottng a traned cassfer For the two-cass probem Two-dmensona pot of the traned cassfer. The cooured shaded regons represent the cacuated vaue of the cassfcaton for that pont BE- FORE appng the sgnum (sgn) functon. Yeow for vaues > 0 and therefore cass + and green for vaues < 0 and therefore cass -. The darker the co- Createdata( fnte, nrofcasses) for creatng mutcass test data Cang conventon s: maketwo- Cass(data, abe_of_one_cas s) abe_of_one_cas s s then mapped to + and the remanng ones to -. On appcabe f the data/feature vectors are - dmensona! The dashed nes represent the margn. 5

127 Mutcass svcpot3d svcpotd_mu tcass our the greater the vaue (see the egend) Same as above but ths three-dmensona pot vsuazes the cacuated vaue of the cassfcaton BEFORE appng the sgnum (sgn) functon n the thrd dmenson Same as for the twocass cassfcaton above but for the probem of three to a mamum of seven casses. Tabe 0.: Lst of fes used n the Matab mpementaton and ther ntenton The dashed nes represent the margn. 0. Matab Eampes Now two eampes how to use the Matab mpementaton n practce. The frst one s for the two-cass probem and the other one shows how to tran a mutcass cassfer. Frst ca createdata or oad a predefned test data set nto workspace. If usng the createdata functon, the screen ooks ke fgure 0. after generatng some ponts b eft-cckng wth the mouse. You can erase ponts b rght-cckng on them and adjust the range of the as b enterng the wanted vaues to the rght. The cass coud be swtched n the upper rght corner wth the combo bo. When read, cck Save and choose a fename and ocaton for savng the new generated data to. Cose the wndow and oad the fe nto the Matab workspace. Then ou shoud see a vector X contanng the feature data and a vector contanng the abes + and - there. 6

128 Fgure 0.: The GUI of the createdata functon after the creaton of some ponts for two-cass cassfcaton b eft-cckng on the screen. Before tranng ou have to specf a kerne to use. In ths mpementaton that s done b creatng a fed as foows: Vaues for tet (the Kerne used): near po rbf mkerne.name tet optona: mkerne.param vaue_ optona: mkerne.param vaue_ vaue_: vaue_: Not used for the near kerne, for the ponoma one ts the dmenson of t (degree) and for the RBF/Gaussan kerne ts the vaue of sgma/wndow wdth. On used for the ponoma kerne, where t s the constant c added. If none of the ast two parameters s gven, defaut vaues are used. There shoud be a new varabe n the workspace caed mkerne. 7

129 In ths eampe we use: mkerne.name po mkerne.param Now we are read for tranng. For ths there are two functons avaabe wth same cang conventon: SMO SMO_Keerth As the names mp, the frst one mpements the orgna SMO agorthm and the second one the mproved agorthm b Keerth wth modfcaton. In an sense the second one shoud awas be used, because as stated n the former part, the orgna SMO s ver sow and coud run nfnte f ou choose to separate the data b meanngs of hard margn but t s not separabe wthout errors. To tran the cassfer smp ca: [aphas bas nsv tranerror] SMO_Keerth(X,, upper_bound_c, eps, to, mkerne) X s the tranng set are the abes (+, -) upper_bound_c s ether nf for the hard margn case or an vaue > 0 for the soft-margn one (here: nf) eps s the accurac, norma set to 0.00 to s the toerance for checkng the KKT condtons, norma 0.00 mkerne s the fed created above Returned vaues are: aphas s the arra contanng the cacuated Lagrange mutpers bas s the cacuated bas nsv s the number of support vectors (apha > 0) tranerror s the error rate n % on the tranng set If usng the orgna functon SMO( ) there s need for another parameter after the mkerne varabe: -norm, whch s zero for usng hard-margn or -norm soft-margn and one for usng the -norm. After pressng Return the tranng process starts and ou get the overvew as n fgure 0.3 after the tranng has ended. Now f the upper cang conventon s used ou got two new created varabes n the workspace for further usage: aphas and bas Now the resut can be vsuazed b usng the functons svcpotd and/or svcpot3d, as can be seen n fgure 0.4 and 0.5. The have the same cang conventon: 8

130 svcpotd(x,, mkerne, aphas, bas) Where X s agan the tranng data as before and aso the abes and mkerne. Aphas and bas are the varabes ganed through the tranng process. Fgure 0.3: After tranng ou get the resuts: vaues of aphas, the bas, the tranng error and the number of support vectors (nsv) 9

0.. 5: svcpot3d after the tranng of a ponoma kerne wth degree two on the tranng set created n

131 Fgure 0.4: svcpotd after the tranng of a ponoma kerne wth degree two on the tranng set created as n fgure 0.. Fgure 0.5: svcpot3d after the tranng of a ponoma kerne wth degree two on the tranng set created n fgure 0.. The second eampe consst of four casses to show how mutcass cassfcaton works here. Agan create a tranng set wth createdata, but now b cang (see fgure 0.6): createdata( fnte, 4) 30

132 Fgure 0.6: GUI for creatng a four-cass probem In ths eampe we use a near kerne: mkerne.name near After agan oadng the created data nto workspace, gettng the varabes X and, we are read for tranng: [aphas bas nsv tranerror overa_error] Mutcass_SMO_Keerth(X,, upper_bound_c, eps, to, mkerne) The on dfference here to the bnar case s the addtona return vaue of the overa error, because tranerror s the error rate of each cassfer traned durng the process (ou have mutpe ones, because the OVOmethod s used). After the tranng ou get agan the resuts as n fgure 0.3. Now agan the traned cassfer(s) can be potted: svcpotd_mutcass(x, mkerne, apha, bas, nr_of_casses) where X s the tranng data, nr_of_casses s the number of casses used n tranng (here: 4) and the other parameters are the same as n the bnar case above. Ths pot woud take a tte more tme to show up but n the end t ooks ke the one n fgure

Fgure 0.7: svcpotd_mutcass caed after the tranng of the four casses as created n fgure 0.6. 0.3 The C++ Impementaton for the Neura Network Too The man goa of ths work was the ntegraton of the SVM modue nto the aread estng Neura Network Too created b Semens VDO.

133 Fgure 0.7: svcpotd_mutcass caed after the tranng of the four casses as created n fgure The C++ Impementaton for the Neura Network Too The man goa of ths work was the ntegraton of the SVM modue nto the aread estng Neura Network Too created b Semens VDO. The appcaton GUI s shown n fgure 0.8 wth a test of the SVM. The too was consstng of two ntegrated cassfers: The ponoma one and the Rada Bass Functon cassfer. It was capabe of: Tran mutpe nstances of a cassfer on separate or the same tranng set(s) Vsuazng data of two dmensons and the traned cassfer Storng resuts and the parameters of a cassfer for oadng an aread traned cassfer and test t on another data set 3

134 Fgure 0.8: The Neura Network Too wth ntegrated SVM. Here an overappng tranng set was traned wth the Gaussan/RBF kerne wth no error. The ntegraton of the new modue was eas because of the aread open sstem for further ntegraton of new cassfcaton technques. So the todos where the foowng: Programmng of a contro GUI for the SVM Programmng of the agorthms themseves Store and oad procedures for the reevant parameters to oad a traned cassfer at a ater tme Store procedures for the resuts of a tranng run As the agorthms used have been tested etensve n Matab the dd not need an further debuggng here. And as a beneft of the saved tme therefore some addtons were made not mpemented n Matab. For eampe now one s abe to do a Grd Search for the upper bound C as de- 33

scrbed n chapter 7. but wthout cross-vadaton. Agorthms mpemented here are the orgna SMO wth - and -norm capabtes and the mproved SMO b Keerth wth modfcaton two.

135 scrbed n chapter 7. but wthout cross-vadaton. Agorthms mpemented here are the orgna SMO wth - and -norm capabtes and the mproved SMO b Keerth wth modfcaton two. The program was spt nto few modues, whch can be seen n fgure 0.9. Fgure 0.9: The UML dagram for the ntegraton of the SVM modue In fgure 0.0 the man contro daog for confgurng a reevant stuff for SVM tranng can be seen. At the top ou see the actua fe oaded for tranng or testng. Beow on the eft ou can seect the kerne to use (wthout knowedge one shoud start wth the Gaussan/RBF one, defaut) and the agorthm of choce. Keerth shoud be seected awas because of hs bg advantages descrbed n the chapters beforehand. On the rght hand sde a other mportant varabes are accessbe, such as the upper bound C (checkbo for the hard-margn case, f deseected ou can enter an upper bound b hand > 0), the kerne parameters (ponoma degree/constant or sgma for the Gaussan/RBF kerne),the accurac of cacuaton and the toerance for checkng the KT condtons (defaut vaues here are 0.00). 34

136 Fgure 0.0: Man contro nterface for confgurng the mportant parameters for SVM tranng Remember that f ou seect the SMO -norm as agorthm, no hardmargn cassfcaton s possbe and therefore t s not seectabe then. The nput for the ponoma degree and sgma s shared n one edt bo, ndcated b the tet net to t, whch w be swtched approprate. In the ower haf ou can check the bo net to Upper Bound C for dong a grd search over predefned vaues of C. Ths smp trans cassfers wth dfferent vaues for the upper bound C (current these are:,,..., and nfnt for the hard-margn case; such eponenta growng vaues were recom- 35

mended b Ln as seen n chapter 7.) and shows the resuts n a daog after tranng (see fgure 0.). Then one can seect the best parameters for tranng the cassfer. Fgure 0.

137 mended b Ln as seen n chapter 7.) and shows the resuts n a daog after tranng (see fgure 0.). Then one can seect the best parameters for tranng the cassfer. Fgure 0.: The resuts of a grd search for the upper bound C. From eft to rght t dspas: Number of support vectors (NSV), Kerne parameters (unused et), the used upper bound C and the tranng error n %. So one can eas see the genera deveopment of the tranng process for dfferent vaues of C. Remarkabe here s the fast decrease of the NSV wth ncreasng C. As stated n chapter 9 most of the tmes the fewer support vectors there are the better the generasaton w be. A n a ths search heps a ot fndng the optma vaue for the upper bound. Wth the ater mpementaton of the grd search for the kerne parameters ths w be a powerfu too to fnd the best suted parameters for the probem at hand. Last but not east wth the Stop Learnng button one can nterrupt the tranng process and at the bottom of the daog s a progress bar to gve vsua feedback of the earnng or testng progress. As the Neura Network Too s propert of Semens VDO I am not abe to ncude source fes here or on the CD, but a Matab fes are ncuded for testng purposes. 36

MARKOV CHAIN AND HIDDEN MARKOV MODEL

MARKOV CHAIN AND HIDDEN MARKOV MODEL JIAN ZHANG JIANZHAN@STAT.PURDUE.EDU Markov chan and hdden Markov mode are probaby the smpest modes whch can be used to mode sequenta data,.e. data sampes whch are not