Part II. Support Vector Machines

Size: px

Start display at page:

Download "Part II. Support Vector Machines"

Laurence Williamson
6 years ago
Views:

1 Part II Support Vector Machnes 35

2 Chapter 5 Lnear Cassfcaton 5. Lnear Cassfers on Lnear Separabe Data As a frst step n understandng and constructng Support Vector Machnes e stud the case of near separabe data hch s smp cassfed nto to casses the postve and the negatve one aso knon as bnar cassfcaton. To gve a nk to an eampe mportant noadas magne the cassfcaton probem of ema nto spam or not-spam. A cacuated eampe and eampes on near non-separabe data can be found n Append B. Ths s frequent performed b usng a rea-vaued functon f : X R n R n the foong a: The nput n s assgned to the postve cass f f and otherse to the negatve one. The vector s bud up b the reevant features hch are used for cassfcaton. In our spam eampe above e need to etract reevant features certan ords from the tet and bud a feature vector for the correspondng document. Often such feature vectors consst of the counted numbers of predefned ords as n fgure 5.. If ou oud ke to earn more about tet cassfcaton / categorzaton ou can have a ook at [Joa98] here the feature vectors have dmensons n the range about 9. In ths dpoma e assume that the features are aread avaabe. We consder the case here f s a near functon of X so t can be rtten as f b 5. b here b R n R are the parameters. 36

3 Fgure 5.: Vector representaton of the sentence Take Vagra before atchng a vdeo or eave Vagra be to pa n our onne casno. These are often referred to as eght vector and bas b terms borroed from the neura netork terature. As stated n Part I the goa s to earn these parameters from the gven and aread cassfed data done b the supervsor/teacher the tranng set. So ths a of earnng s caed supervsed earnng. So the decson functon for cassfcaton of an nput n s gven b sgn f : sgn f f f postve cass - ese negatve cass Geometrca e can nterpret ths behavour as foos see fgure 5.: One can see that the nput space X s spt nto to parts b the so caed hperpane defned b the equaton b. Ths means ever nput vector sovng ths equaton s drect part of the hperpane. A Hperpane s an affne subspace 7 of dmenson n- hch dvdes the space nto to haf spaces hch correspond to the nputs of the to dstnct casses. 7 A transaton of a near subspace of n R s caed an affne subspace. For eampe an ne or pane n 3 R s an affne subspace. 37

4 In the eampe of fgure 5. n s a to dmensona nput space so the hperpane s smp a ne here. The vector therefore defnes a drecton perpendcuar to the hperpane so the drecton of the pane s unque he varng the vaue of b moves the hperpane parae to tsef. Whereb negatve vaues of b move the hperpane runnng through the orgn nto the postve drecton. Fgure 5.: A separatng hperpane b for a to dmensona tranng set. The smaer dotted nes represent the cass of hperpanes th same and dfferent vaues of b. In fact t s cear to see that f one ants to represent a possbe hperpanes n the space R the representaton s on possbe b nvovng n n free parameters n ones gven b and one b b. But the queston that arses here s hch hperpane to choose because there are man possbe as n hch t can separate the data. So e need a crteron for choosng the best one the optma separatng hperpane. The goa behnd supervsed earnng from eampes for cassfcaton can be restrcted to consderaton of the to-cass probem thout oss of generat. In ths probem the goa s to separate the to casses b a functon hch s nduced from avaabe eampes. The overa goa s to produce a cassfer b fndng parameters and b that ork e on unseen eampes.e. t generazes e. 38

5 So f the dstance beteen the separatng hperpane and the tranng ponts becomes too sma even test eampes near to the gven tranng ponts oud be mscassfed. Fgure 5.3 ustrates ths behavour. Therefore t seems that the cassfcaton of unseen data s much more successfu n settng B than n settng A. Ths observaton eads to the concept of the mama margn hperpanes or the optma separatng hperpane. Fgure 5.3: Whch separaton to choose? Amost zero margn A or arge margn B? In append B. e have a coser ook at an eampe th a smpe teratve agorthm separatng ponts from to casses b means of a hperpane the so caed Perceptron. It s on appcabe on near separabe data. There e aso fnd some mportant ssues aso stressed n the foong chapters hch have a arge mpact on the agorthms used n the Support Vector Machnes. 5. The Optma Separatng Hperpane for Lnear Separabe Data Defnton 5. Margn Consder the separatng hperpane H defned b b th both and b normased b and b b. 39

6 The functona margn γ b of an eampe th respect to H s defned as the dstance beteen and H: γ b b The margn γ S b of a set of vectors A { n } s defned as the mnmum dstance from H to the vectors n A: γ S b A mn γ b For carfcaton see fgures 5.4 and 5.5. Fgure 5.4: The functona margn of to ponts n respect to a hperpane In fgure 5.5 e have ntroduced to ne dentfers: d and d - : Let them be the shortest dstance from the separatng hperpane H to the cosest postve negatve eampe the smaest functona margn from each cass. Then the geometrc margn s defned as d d -. 4

7 Fgure 5.5: The geometrc margn of a tranng set The tranng set s therefore sad to be optma separated b the hperpane f t s separated thout an error and the dstance beteen the cosest vectors to the hperpane s mama mama margn [Vap98].. So the goa s to mamze the margn As Vapnk shoed n hs ork [Vap98] e can assume canonca hperpanes n the upcomng dscusson thout oss of generat. Ths s necessar because there ests the foong probem: For an scang parameter c : b c cb E.g. A possbe souton s Wth a parameter c of vaue 5 e get 4

8 5 5 hch can aso be soved b. So c cb descrbe the same hperpane as b do. Ths means the hperpane s not descrbed unque! For unqueness b aas need to be scaed b a factor c reatve to the tranng set. The foong constrant s chosen to do ths: mn b Ths constrant scaes the hperpane n a a such that the tranng ponts nearest to t get some mportant propert. No the sove b for of cass and on the other sde b for of cass -. A such scaed hperpane s caed a canonca hperpane. Reformuated ths means mpng correct cassfcaton: b ; 5. Ths can be transformed nto the foong constrants: b for b for Therefore t s cear to see that the hperpanes H and H n fgure 5.5 are sovng b and b. The are caed margn hperpanes. Note that H and H are parae the have the same norma as H does too and that no other tranng ponts fa beteen them n the margn! The sove mn b. 4

9 Defnton 5. Dstance The Eucdan dstance db; of a pont beongng to a cass from the hperpane b that s defned b b s { } d b; b 5.4 As stated above tranng ponts and - that are nearest to the so scaed hperpane respectve the e on H and H have the dstance d and d - - from t see fgure 5.5. Or reformuated th equaton 5.4 and constrants 5.3 ths means: b and b b and b 5.4 d and d So overa as seen n fgure 5.5 the geometrc margn of a separatng canonca hperpane s d d - and so. As stated the goa s to mamze ths margn. That s acheved b mnmsng. The transformaton to a quadratc functon of the form Φ does not change the resut but ease ater cacuaton. Ths s because e no sove the probem th hep of the Lagrangan method. There are to reasons for dong so. Frst the constrants of 5. be repaced b constrants on the Lagrangan themseves hch 43

10 be much easer to hande the are equates then. Second the tranng data on appear n the form of dot products beteen vectors hch be a cruca concept ater n generazng the method on the nonnear separabe case and the use of kernes. And so the probem s reformuated n a conve one hch s overa easer to hande b the Lagrangan method th ts dfferentatons. Summarzng e have the foong optmzaton probem to sove: Gven a near separabe tranng set S Mnmze subect to b for b for The constrants are necessar to ensure unqueness of the hperpane as mentoned above! Note: because... n... n n... n... n Aso the optmzaton probem s ndependent of the Bas b because the provded equaton 5. s satsfed.e. t s a separatng hperpane. So changng the vaue of b on moves t n the norma drecton to tsef. Accordng the margn remans unchanged but the hperpane oud no onger be optma. The probem of 5.5 s knon as conve quadratc optmzaton 8 probem th near constrants and can be effcent soved b usng the method of the Lagrange Mutpers and the duat theor see chapter 4. 8 Convet be proofed n chapter

11 45 Mamze W subect to ; The prma Lagrangan for 5.5 and the gven near separabe tranng set S s [ ] P b b L 5.6 here are the Lagrange Mutpers. Ths Lagrangan L P has to be mnmzed th respect to the prma varabes and b. As seen n chapter 4 at the sadde pont the to dervatons th respect to and b must vansh statonart P b L P b L b obtanng the foong reatons: 5.7 B substtutng the reatons 5.7 back nto L P one arrves at the so caed Wofe Dua of the optmzaton probem no on dependabe on no more and b!: [ ] W b b L D 5.8 So the dua probem for 5.6 can be formuated: Gven a near separabe tranng set S 5.9

12 Note: The matr s knon as the Gram Matr G. So the goa s to fnd parameters hch sove ths optmzaton probem. As a souton to construct the optma separatng hperpane th mama margn e obtan the optma eght vector: 5. Remark: One can thnk that up to no the probem be abe to be soved eas as the one n append C th the use of Lagrangan theor and the prma dua obectve functon. Ths coud be rght f havng nput vectors of sma dmenson e.g.. But n the rea-ord case the number of varabes be over some thousand ones. Here sovng the sstem th standard technques not be practcabe n the case of tme and memor usage of the correspondng vectors and matrces. But ths ssue be dscussed n the mpementaton chapter ater. 5.. Support Vectors Statng the Kuhn-Tucker KT condtons for the prma probem L P above 5.6 as seen n chapter 4 e get b [ L L P P b b b b ] 5. As mentoned the optmzaton probem for SVMs s a conve one a conve functon th constrants gvng a conve feasbe regon. And for conve probems the KT condtons are necessar and suffcent for b and to be a souton. Thus sovng the prma/dua probem of the SVMs 46

13 s equvaent to fndng a souton to the KT condtons for the prma 9 see chapter 3 too. The ffth reaton n 5. s knon as the KT compementar condton. In the thrd chapter on optmzaton theor an ntenton as gven on ho t orks. In the SVM s probem t has a good graphca meanng. It states that for a gven tranng pont ether the correspondng Lagrange Mutper equas zero or f not zero es on one of the margn hperpanes see fgure 5.4 and foong tet H or H : H H : : b b On them are the tranng ponts th mnma dstance to the optma separatng hperpane OSH th mama margn. The vectors ng on H or H mpng > are caed Support Vectors SV. Defnton 5.3 Support Vectors A tranng pont s caed support vector f ts correspondng Lagrange mutper >. A other tranng ponts havng ether e on one of the to margn hperpanes equat of 5. or on the sde of H or H nequat of 5.. A tranng pont can be on one of the to margn hperpanes because the compementar condton n 5. on states that that a SVs are on the margn hperpanes but not that the SVs are the on ones on them. So there ma be the case here both and b. Then the pont es on one of the to margn hperpanes thout beng a SV. Therefore SVs are the on ponts nvoved n determnng the optma eght vector n equaton 5.. So the cruca concept here s that the optma separatng hperpane s unque defned b the SVs of a tranng set. That means repeatng the tranng th a other ponts removed or moved around thout crossng H or H ead to the same eght vector and therefore to the same optma separatng hperpane. 9 on the be needed because the prma/dua probem s a equvaent one so e mamze the dua t s on dependabe on! and as a crteron take the KT condtons of the prma. 47

14 In other ords a compresson has taken pace. So for repeatng the tranng ater the same resut can be acheved b on usng the determned SVs. Fgure 5.6: The optma separatng hperpane OSH th mama margn s determned b the support vectors SV marked ng on the margn hperpanes H and H. Note that n the dua representaton the vaue of b does not appear and so the optma vaue b has to be found makng use of the prma constrants: b ; So on the optma vaue of s epct determned b the tranng procedure. Ths mpes e have optma vaues for. Therefore t s possbe to pck an a support vector and so th the substtuton n the above nequat the upper constrant becomes an equat because a support vector aas s part of a margn hperpane and b can be computed. Numerca t s safer to compute b for a and take the mean vaue or another approach as n the book [Ne]: 48

15 ma mn b 5. Note: Ths approach to compute the bas has been shon to be probematc th regard to the mpementaton of the SMO agorthm as shoed b [Ker]. Ths ssue be dscussed n the mpementaton chapter ater. 5.. Cassfcaton of unseen data After the hperpanes parameters and b have been earned th the tranng set e can cassf unseen/unabeed data ponts z. In the bnar case casses dscussed up to no the found hperpane dvdes the n R nto to regons. One here b > and the other one here b <. The dea behnd the mama margn cassfer s to determne on hch of the to sdes the test pattern es and to assgn the abe correspondng th - or as a cassfers and aso to mamze the margn beteen the to sets. Hence the used decson functon can be epressed th the optma parameters and b SV and therefore b the found/used support vectors ther correspondng > and b. So overa the decson functon of the traned mama margn cassfer for some data pont z can be formuated: f z b sgn z b sgn sgn SV z b 5.3 z b Whereb the ast reformuaton on sums over the eements tranng pont correspondng abe assocated and the bas b hch are assocated th a support vector SV because on the have > and therefore an mpact on the sum. A n a the optma separatng hperpane e get b sovng the margn optmzaton probem s a ver smpe speca case of a Support Vector Machne because t computes drect on the nput data. But t s a good startng pont for understandng the forthcomng concepts. In the net chapters the concept be generazed to nonnear cassfers and there- 49

16 fore the concept of Kerne mappng be ntroduced. But frst the adapton of the separatng hperpane on near non-separabe data be done. 5.3 The Optma Separatng Hperpane for Lnear Non-Separabe Data The agorthm above for the mama margn cassfer cannot be used n man rea-ord appcatons. In genera nos data render near separaton mpossbe but the hugest probem st be the used features n practce eadng to overappng casses. The man probem th the mama margn cassfer s the fact that t aos no cassfcaton errors durng tranng. Ether the tranng s perfect thout an errors or there s no souton at a. Hence t s ntutve that e need a a to rea the constrants of 5.3. But each voaton of the constrants needs to be punshed b a mscassfcaton penat.e. an ncrease n the prma obectve functon L P. Ths can be reazed b ntroducng the so caed postve sack varabes ξ n the constrants frst and as shon ater ntroduce an error eght C too: b ξ for b ξ for - ξ As above these to constrants can be rertten nto one: b ξ ; 5.4 So the ξ `s can be nterpreted as a vaue that measures ho much a pont fas to have a margn dstance to the OSH of. So t ndcates here a pont es compared to the separatng hperpane see fgure 5.7. < ξ b mscassfcaton < ξ < s cassfed correct but es nsde the margn 5

17 ξ s cassfed correct and es outsde the margn or on the margn boundar So a cassfcaton error s marked b the correspondng ξ eceedng unt. Therefore ξ s an upper bound on the number of tranng errors. Overa th the ntroducton of these sack varabes the goa s to mamze the margn and smutaneous mnmze mscassfcatons. To defne a penat on tranng errors the error eght C s ntroduced b C Ths parameter has to be chosen b the user. In practce C s vared through a de range of vaues and the optma performance s assessed usng a separate vadaton set or a technque caed cross-vadaton for verfng performance ust usng the tranng set. ξ. Fgure 5.7: Vaues of sack varabes: mscassfcaton f ξ s arger than the margn ξ ; correct cassfcaton of ng n the margn th < correct cassfcaton of outsde the margn or on t th ξ < ξ ; 3 5

18 So the optmzaton probem can be etended to Mnmze C ξ k subect to b ξ ; ξ 5.5 The probem s agan a conve one for an postve nteger k. Ths approach s caed the Soft Margn Generazaton he the orgna concept above s knon as Hard Margn because t aos no errors. The Soft Margn case s de adapted to the vaues of k -Norm Soft Margn and k -Norm Soft Margn Norm Soft Margn - or the Bo Constrant For k as above the prma Lagrangan can be formuated as L P b ξ ß Cξ [ b ξ ] ßξ th ; β. Note: As descrbed n chapter 4 e need another parameter ß here because of the ne nequat constrant ξ. As before the correspondng dua representaton s found b dfferentatng L P th respect to ξ and b: ξ b L L L P P P C β B resubsttutng these reatons back nto the prma e obtan the dua formuaton L D : 5

19 Gven a tranng set S Mamze LD b ξ β W subect to C 5.6 Ths probem s curous dentca to that for the mama hard margn one n 5.9. The on dfference s that C β together th β enforces C. So n the soft margn case the Lagrange mutpers are upper bounded b C. The Kuhn-Tucker compementar condtons for the prma above are: [ b ξ ] ; ξ C ; Another consequence of the KT condtons s that the mp that non-zero sack varabes ξ can on occur hen β and therefore C. The correspondng pont has a dstance ess than / from the hperpane and therefore es nsde the margn. Ths can be seen th the constrants on shon for the other case s anaogous: b b for ponts on the margn hperpane. b ξ b ξ < And therefore ponts th non-zero sack varabes have a dstance ess than /. Ponts for hch C then e eact at the target dstance of / and therefore on one of the margn hperpanes ξ. Ths aso shos that the hard margn hperpane can be attaned n the soft margn case b settng C to nfnt. 53

20 54 The fact that the Lagrange mutpers are upper bounded b the vaue of C gves the name to ths technque: bo constrant. Because the vector s constraned to e nsde the bo th sde ength C n the postve orthant. Ths approach s aso knon under SVM th near oss functon Norm Soft Margn - or Weghtng the Dagona - Ths s the case for k. But before statng the prma Lagrangan and for ease of the upcomng cacuaton note that for < ξ the frst constrant of 5.5 st hods f ξ. Hence e st obtan the optma souton hen the postvt constrant on ξ s removed. So ths eads to the foong prma Lagrangan: P b C b C b L ] [ ξ ξ ξ ξ ξ th the Lagrange mutpers agan. As before the correspondng dua s found b dfferentatng th respect to ξ and b mposng statonart.e. settng to zero: P P P L b C L L ξ ξ and agan resubsttutng the reatons back nto the prma to obtan the dua formuaton L D : ξ C C C b L D

21 Usng the equaton δ δ here δ s the Kronecker Deta hch s defned to be f and otherse. So on the rght sde of above equaton nsertng changes nothng at the resut because s ether or - and δ s the same as rtng and so e smp mutp etra b but can smpf L D to get the fna probem to be soved: Gven a tranng set S L D Mamze b ξ W δ C subect to ; 5.7 The compementar KT condtons for the prma probem above are [ b ξ ] ; Ths hoe probem can be soved th the same methods used for the mama margn cassfer. The on dfference s the addton of /C to the dagona of the Gram matr G. On on the dagona because of the Kronecker Deta. Ths approach s aso knon under SVM th quadratc oss functon. Summarzng ths subchapter t can be sad that the soft margn optmzaton s a compromse beteen tte emprca rsk and mama margn. For an eampe ook at fgure 5.8. The vaue of C can be nterpreted as representng the trade-off beteen mnmzng the tranng set error and mamzng the margn. So a n a b usng C as an upper bound on the Lagrange mutpers the roe of outers s reduced b preventng a pont from havng too arge Lagrange mutpers. 55

22 a b 56

23 c Fgure 5.8: Decson boundares arsng hen usng a Gaussan kerne th fed vaue of σ n the three dfferent machnes: a the mama margn SVM b the -norm soft margn SVM and c the -norm soft margn SVM. The data are an artfca created to dmensona set. The bue dots beng postve eampes and the red ones negatve eampes. 5.4 The Duat of Lnear Machnes Ths secton s ntended to stress the fact that as used and remarked severa tmes before. The near machnes ntroduced above can be formuated n a dua descrpton. Ths reformuaton turn out to be cruca n the constructon of the more poerfu generazed Support Vector Machnes beo. But hat does duat of cassfers mean? As seen n the former chapter the norma vector can be represented as a near combnaton of the tranng ponts: th S the gven tranng set aread cassfed b the supervsor. The ere ntroduced n the used Lagrange a to fnd a 57

24 souton to the margn mamzaton probem. The ere caed the dua varabes of the probem and therefore the fundamenta unknons. On the a to the souton e then obtan W and the reformuated decson functon for unseen data z of 5.3: f z b sgn z b sgn z b The cruca observaton here s that the tranng and test ponts never act through ther ndvdua attrbutes. These ponts on appear as entres n the Gram Matr G n the tranng phase and ater n the test phase the on appear n an nner product th the tranng ponts z. 5.5 Vector/Matr Representaton of the Optmzaton Probem and Summar 5.5. Vector/Matr Representaton To gve a frst mpresson on ho the above probems can be soved usng a computer the probems be formuated n the equvaent notaton th vectors and matrces. Ths notaton s more practca understandabe and are used n man mpementatons. As descrbed above the conve quadratc optmzaton probem hch arses for hard C -norm C < and -norm change the Gram matr b means of addng /C to the dagona margn s the foong: 58

25 Mamze LD b ξ β W subect to C Ths probem can be epressed as: Mamze subect to e T Q C T T 5.8 here e s the vector of a ones C > the upper bound Q s a b postve semdefnte matr Q. And th a correct tranng set S th the ength of 5.8 oud ook ke:... n Q Q... Q Q Q Q 5.5. Summar As seen n chapter 4 quadratc probems th a so caed postve sem- defnte matr are conve functons. Ths aos the cruca concepts of soutons to conve functons to be adapted see chapter 4: conve KT. Semdefnte: For each Q net page for epanaton T Q has non-negatve egenvaues. Aso see 59

26 In former chapters the convet of the obectve functon has been assumed thout proof. So et M be an possb non-square matr and set A M T M. Then A s a postve sem-defnte matr snce e can rte T T T T A M M M M M M M 5.9 for an vector. If e take M to be the matr hose coumns are the vectors then A s the Gram Matr of the set S shong that Gram Matrces are aas postve sem-defnte. And therefore the above matr Q aso s postve sem-defnte. Summarzed the probem to be soved up to no can be stated as Mamze LD b ξ β W subect to C 5. th the partcuar smpe prma KT condtons as crterons for a souton to the -norm optmzaton probem: C < < C C b b b 5. Notce that the sack varabes ξ do not need to be computed for ths case because as seen n chapter 5.3. the on be non-zero f C and β. And so reca the prma of ths chapter stated as LP b ξ ß Cξ [ b ξ ] ßξ Then set C β so the thrd sum s zero and from the second 6

27 6 sum e get C ξ hch s equvaent to C ξ and so t be deeted and no sack varabe s there anmore. For the mama margn case the condtons be: < b b 5. And ast but not east for the -norm case: C b b < 5.3 The ast condton s reformuated b means of mpct defnng ξ th hep of the prma KT condton ξ ξ C L P of chapter 5.3. and therefore C ξ. And th the compementar KT condton ] [ b ξ the thrd condton above s ganed. As seen n the soft margn chapters ponts for hch the second equaton hods are Support Vectors on one of the margn hperpanes and for hch the thrd one hods are nsde the margn therefore caed margn-errors. These KT condtons be used ater and proof to be mportant hen mpementng agorthms for computatona numerca sovng the probem of 5.. Because a pont s an optmum of 5. f and on f the KT condtons are fufed and Q s postve sem-defnte. The second requrement s proven above. And after the tranng process the sovng of the quadratc optmzaton probem and as a souton gettng the vector and therefore bas b the cassfcaton of unseen data z s performed b

28 f z b sgn z b sgn sgn SV z b 5.3 z b here the are the tranng ponts th ther correspondng greater than zero and upper bounded b C and therefore support vectors. As one can thnk no the queston arsng here s h aas cassf ne data b the use of the and h not smp savng the resutng eght vector? Sure up to no t be possbe to do that and so no further need of havng to store the tranng ponts and ther abes. But as seen above there be ver fe support vectors norma and on them and ther correspondng and are necessar to reconstruct. But the man reason be gven n chapter 5 here e see that e must use the and not smp store. To gve a short nk to the mpementaton ssue dscussed ater t can be sad that n most cases the -norm s used because n rea-ord appcatons ou norma not have nose-free near separabe data and therefore the mama margn approach not ead to satsfactor resuts. But the man probem s st the seecton of the used feature data n practce. The -norm s used n feer cases because t s not eas to ntegrate n the SMO agorthm dscussed n the mpementaton chapter. 6

29 Chapter 6 Nonnear Cassfers The ast chapter shoed ho the near cassfers can eas be computed b means of standard optmzaton technques. But near earnng machnes are restrcted because of ther mted computatona poer as hghghted n the 96 s b Mnsk and Papert. Summarzed t can be stated that rea-ord appcatons requre more epressve hpothess spaces than near functons. Or n other ords the target concept ma be too compe to be epressed as a smpe near combnaton of the gven attrbutes That s hat near machnes do equvaent to: the decson functon s not a near functon of the data. Ths probem can be overcome b the use of the so caed kerne technque. The genera dea s to map the nput data nonnear to a near aas hgher dmensona space and then separate t ther b near cassfers. Therefore ths resut n a nonnear cassfer n nput space see fgure 6.. Another souton to ths probem has been proposed n the neura netork theor: Mutpe aers of threshoded near functons hch ed to the deveopment of mut-aer neura netorks. Fgure 6.: Smper cassfcaton task b a feature mapφ. -dmensona Input space on the eft -dmensona feature space on the rght here e are abe to separate b a near cassfer hch eads to the nonnear cassfer n nput space. 63

30 6. Epct Mappngs No the representaton of tranng eampes be changed b mappng the data to a possb nfnte dmensona Hbert space F. Usua the space F have a much hgher dmenson than the nput space X. The mappng Φ : X F s apped to each abeed eampe before tranng and then the optma separatng hperpane s constructed n the space F. Φ : X n n F... a Φ Φ... Φ 6. Ths s equvaent to mappng the hoe nput space X nto F. The components of Φ are caed features he the orgna quanttes are sometmes referred to as the attrbutes. F s caed the feature space. The task of choosng the most sutabe representaton of the data s knon as feature seecton. Ths can be a ver dffcut task. There are dfferent approaches estng to feature seecton. Frequent one seeks to dentf the smaest set of features that st conves the essenta nformaton contaned n the orgna attrbutes. Ths s knon as dmensonat reducton... a Φ Φ... Φ d n 6. n d < and can be ver benefca as both computatona and generazaton performance can degrade as the number of features gros a phenomenon knon as the curse of dmensonat. The dffcutes one s facng th hgh dmensona feature spaces s that snce the arger the set of probab redundant features s the more ke t s that the functon to be earned coud be represented usng a standardsed earnng machne. Another approach to feature seecton s the detecton of rreevant features and ther emnaton. As an eampe consder the gravtaton a hch on uses nformaton about the masses and the poston of to bodes. So an rreevant feature oud be the coour or the temperature of the to bodes. So as a ast ord to sa on feature seecton t shoud be consdered e as a part of the earnng process. But t s aso natura a somehat arbtrar step hch needs some pror knoedge on the underng target functon. Therefore recent research has been done on the technques for Is a vector space th some more restrctons. A space H s separabe f there est a countabe subset D H such that ever eement of H s the mt of a sequence of eements of D. A Hbert space s a compete separabe nner product space. Fnte dmen- n sona vector spaces ke R are Hbert spaces. Ths space be descrbed n deta a tte further n ths chapter and for further readngs see [Ne]. 64

31 feature reducton. Hoever n the rest of the dpoma e do not tak about the feature seecton technques because as Chrstann and Shae- Taor proofed n ther book [Ne] e can afford to use nfnte dmensona feature spaces and avod computatona probems b the means of the mpct mappng descrbed n the net chapter. So the curse of dmensonat can be sad to be rreevant b mpct mappng the data aso knon as the Kerne Trck. Before ustratng the mappng th an eampe frst notce that the on a n hch data appears n the tranng probem s n the form of dot products. No suppose ths data s frst mapped to some other possbe nfnte dmensona space F usng the mappng of 6.: Φ : R n F Then of course as seen n 6. and 6. the tranng agorthm oud on depend on the data through dot products n F.e. on functons of the form Φ Φ a other varabes are scaars. Second there s no vector mappng to va Φ but e can rte n the form Φ and the hoe hpothess decson functons be of the tpe f sgn Φ b or reformuated f sgn Φ Φ b. So a support vector machne s constructed hch ves n the ne hgher dmensona space F but a the consderatons of the former chapters st hod snce e are st dong a near separaton but n dfferent space. But no a smpe eampe th epct mappng. Consder a gven tranng set S of ponts n R th cass abes and -: S { }. Trva these three ponts are not separabe b a hperpane here a pont n R see fgure So frst the data s nonnear mapped to the R b appng Input dmenson s therefore the hperpane s of the dmenson and dm s defned as. 65

32 66 R R : 3 Φ Fgure 6.: A non-separabe eampe n the nput space R. The hperpane oud be a snge pont but t cannot separate the data ponts. Ths Step resuts n a tranng set consstng of the vectors th the correspondng abes -. As ustrated n fgure 6.3 the souton n the ne space 3 R can be eas seen geometrca n the Φ Φ -pane see fgure 6.4. It s therefore hch s amost normazed et meanng t has a ength of and the bas b becomes b -.5 negatve b means movng the hperpane runnng through the orgn n the postve drecton. So t can be seen that the earnng task can be eas soved n the 3 R b near separaton. But ho does the decson functon ook ke n the orgna space R here e need t?

33 Remember that can be rtten n the form Φ. Fgure 6.3: Creaton of a separatng hperpane.e. a pane n the ne space 3 R. Fgure 6.4: Lookng at the ΦΦ -pane the souton to and b can be eas gven b geometrc nterpretaton of the pcture. 67

34 68 And n our partcuar eampe t can be rtten as : 3 Φ And orked out: 3 3 The sovng vector 3 s then. Wth the equaton Φ Φ 6.3 The hperpane n R then becomes th the orgna tranng ponts n R : b z ; 3 3 z z z z z z z z Φ Φ Ths eads to the nonnear hperpane n R consstng of to ponts: z and z. As seen n equaton 6.3 the nner product n the feature space has a equvaent functon n the nput space. No e ntroduce an abbrevaton for the dot product n feature space: Φz Φ z : K 6.4 Cear that f the feature space s ver hgh-dmensona or even nfnte dmensona the rght-hand sde of 6.4 be ver epensve to com-

35 pute. The observaton n 6.3 together th the probem descrbed above motvates to search for as to evauate nner products n feature space thout makng drect use of the feature space nor the mappng Φ. Ths approach eads to the terms Kerne and Kerne Trck. 6. Impct Mappngs and the Kerne Trck Defnton 6. Kerne Functon Gven a mappng Φ : X F from nput space X to an nner product feature space 3 F e ca the functon K : X X R a kerne functon f for a z X K z Φ Φz. 6.5 The kerne functon then behaves ke an nner product n feature space but can be evauated as a functon n nput space. For eampe take the ponoma kerne K. No assume e have got d and R orgna nput space so e get: d 3 Inner product space: A vector space X s caed a nner product space f there ests a bnear map near n each argument that for each to eements X gves a rea number denoted b satsfng the foong propertes: and e.g.: n... n X R... λ are fed postve numbers. Then the foong defnes a vad nner product: n n λ T A here A s the n n dagona on dagona non-zero matr th non-zero entres λ. A 69

36 7 Φ Φ 6.6 So the data s mapped to the 3 R. But the second ne can be eft out b mpct cacuatng Φ Φ th the vectors n nput space: hat s the same as n the above cacuaton frst mappng the nput vectors to the feature space and then cacuatng the dot product: Φ Φ. So b mpct mappng the nput vectors to the feature space e are abe to cacuate the dot product there thout even knong the underng mappng Φ! Summarzed t can be stated that b mpct performng such a non-near mappng to a hgher dmensona space t can be done thout ncreasng the number of parameters because the kerne functon computes the nner product n feature space on b use of the to nputs n nput space. To generaze here a ponoma kerne d K th d and attrbutes n nput space of dmenson n maps the data to a feature space of dmenson d d n 4. In the eampe of 6.6 ths means th n and d : 4!!! k n k n k n caed the bnoma coeffcent

37 n d d 3!!! 3 3 And as can be seen above the data s rea mapped from the 3 R. R to the In fgure 6.5 the hoe ne procedure for cassfcaton of an unknon pont z s shon after tranng of the kerne-based SVM and therefore havng the optma eght vector defned b the s the correspondng tranng ponts and ther abes and the bas b. Fgure 6.5: The hoe procedure for cassfcaton of a test vector z n ths eampe the test and tranng vectors are smpe dgts. To stress the mportant facts summarzng t can be sad that n contrast to the eampe n chapter 6. the chan of arguments s nverted n that a that there e started b epct defnng a mappng Φ before appng the earnng agorthm. But no the startng pont s choosng a kerne functon K hch mpct defnes the mappng Φ and therefore avodng the feature space n the computaton of nner products as e n the hoe desgn of the earnng machne tsef. As seen above both the earnng and test step on depend on the vaue of nner products n feature space. Hence as shon the can be formuated n terms of kerne functons. So once such a kerne functon has been chosen the decson functon for unseen data z 5.3 becomes: 7

38 f z sgn sgn Φ z b K z b 6.7 And as sad before as a consequence e do not need to kno the underng feature map to be abe to sove the earnng task n feature space! Remark: As remarked n chapter 5 the consequence of usng kernes s that no the drect storng of the resutng eght vector s not practcabe because as seen n 6.7 above then e have to kno the mappng and cannot use the advantage arsng b the usage of kernes. But hch functons can be chosen as kernes? 6.. Requrements for Kernes - Mercer s Condton - As a frst requrement for a functon to be chosen as a kerne defnton 6.5 gves to condtons because the mappng has to be to an nner product feature space. So t can be eas seen that K has to be a smmetrc functon: K z Φ Φz Φz Φ K z 6.8 And another condton for an nner product space s the Scharz Inequat: K z Φ Φz Φ Φ Φ Φz Φz Φz K K zz 6.9 Hoever these condtons are not suffcent to guarantee the estence of a feature space. Here Mercer s Theorem gves suffcent condtons Vapnk 995; Courant and Hbert 953. The foong formuaton of Mercer s Theorem s gven thout proof as t s stated n the paper of [Bur98]. 7

39 Theorem 6. Mercer s Theorem There est a mappng Φ and an epanson K Φ Φ Φ Φ f and on f for an g such that g d < s fnte 6. then K g g dd 6. Note: 6. has to hod for ever g satsfng 6.. Ths theorem s aso suffcent for the nfnte case. Another smpfed condton for K to be a kerne n the fnte case can be seen from and hen descrbng K th ts egenvectors and egenvaues The proof s gven n [Ne]. Proposton 6.3 Let X be a fnte nput space th K z a smmetrc functon on X. Then K z s a kerne functon f and on f the matr K s postve sem-defnte. Therefore Mercer s Theorem s an etenson of ths proposton based on the stud of ntegra operator theor. 6.. Makng Kernes from Kernes Theorem 6. s the basc too for verfng that a functon s a kerne. The remarked proposton 6.3 gves the requrement for a fnte set of ponts. No ths crteron for a fnte set s apped to confrm that a number of ne kernes can be created. The net proposton of Chrstann and John Shae-Taor [Ne] aos creatng more compcated kernes from smpe budng bocks: 73

40 Proposton 6.4 Let KandK be kernes over a rea-vaued functon on X X n X X R a R f Φ : X R m th K 3 a kerne over R m R m p a ponoma th postve coeffcents and B a smmetrc postve sem-defnte n n matr. Then the foong functons are kernes too: K z K z K z K z ak z K z K z K z K z ffz K z K 3 Φ Φz 6. K z T Bz K z p K z K z ep K z 6..3 Some e-knon Kernes The seecton of a kerne functon s an mportant probem n appcatons athough there s no theor to te hch kerne hen to use. Moreover t can be ver dffcut to check that some partcuar kerne satsfes Mercer s condtons snce the must hod for ever g satsfng 6.. In the foong some e knon and de used kernes are presented. Seecton of the kerne perhaps from among the presented ones s usua based on eperence and knoedge about the cassfcaton probem at hand and aso theoretca consderatons. The probem of choosng a kerne and ts parameters on the bass of theoretca consderatons be dscussed n chapter 7. Each kerne be epaned beo. 74

41 Ponoma K z p z c 6.3 Sgmod K z tanh κ z δ 6.4 Rada Bass Functon z - Gaussan Kerne - K z ep 6.5 σ Ponoma Kerne Here p gves the degree of the ponoma and c s some non-negatve constant usua c. Usage of another generazed nner-product nstead of the standard nner product above as proposed n man other orks on SVMs because the Hessan matr then becomng zero n numerca cacuatons ths means no souton for the optmzaton probem. Then the Kerne become: K z z σ c p Where the vector σ s such chosen that the functon satsfes Mercer s condton. 75

Fgure 6.6: A ponoma kerne of degree used for the cassfcaton of the nonseparabe XOR-data set n nput space b a near cassfer. Each coour represents one cass and the dashed nes mark the margns.

42 Fgure 6.6: A ponoma kerne of degree used for the cassfcaton of the nonseparabe XOR-data set n nput space b a near cassfer. Each coour represents one cass and the dashed nes mark the margns. The eve of shadng ndcates the functona margn or n other ords: The darker the shadng of one coour representng a specfc cass the more confdent the cassfer s that ths pont n that regon beongs to that cass Sgmod-functon The Sgmod kerne stated above usua satsfes Mercer s condton on for certan vaues of the parameters κ and δ. Ths as notced epermenta b Vapnk. Current there are no theoretca resuts on the parameter vaues that satsf Mercer s condtons. As stated n [Pan] the usage of the sgmod kerne th the SVM can be regarded as a to-aer neura netork. In such netorks the nput vector z s mapped b the frst aer nto the vector F F...FN here F tanh κ z δ N and the dmenson of F s caed the number of the Hdden Unts. In the second aer the sgn of the eghted sum of the eements of F s cacuated b usng eghts γ. Fgure 6.7 ustrates that. The man dfference to notce beteen SVMs and to-aer neura netorks s the dfferent optmzaton crteron: In the SVM case the goa s to fnd the optma separatng hperpane hch mamzes the margn n the feature space he n a to-aer neura netork the crteron s usua to mnmze the emprca rsk assocated th some oss functon tpca the mean squared error. 76

43 F γ z F γ ŷ / -... γ N F N Fgure 6.7: A -aer neura netork th N hdden unts. Output of the frst aer are of the form F tanh κ z δ N. Whe the output of the hoe netork then becomes ˆ sgn γ F b. N Another mportant notce shoud be gven here: In neura netorks the optma netork archtecture s qute often unknon and most found on b eperments and/or pror knoedge he n the SVM case such probems are avoded. Here the number of hdden unts s the same as the number of support vectors and the vector of the eghts n the output aer γ are a determned automatca n the near separabe case n feature space Rada Bass Functon Gaussan The Gaussan Kerne s aso knon as the Rada Bass Functon. In the above functon 6.5 σ varance defnes a so caed ndo dth dth of the Gaussan. Sure t s possbe to have dfferent ndo dths for dfferent vectors meanng to use a vector σ see [Cha]. As some orks sho [Ln3] the RBF-kerne be a good startng pont for a frst tr f one knos near nothng about the data to cassf. The man reasons be stated n the upcomng chapter 7 here aso the parameter seecton be dscussed. 77

44 Fgure 6.8: A SVM th a Gaussan Kerne a vaue of sgma σ. and th the appcaton of the mama margn case C nf on an artfca generated tranng set. Another remark to menton here s that up to no the agorthm and so the above ntroduced cassfers are on ntended for the bnar case. But as e see n chapter 8 ths can be eas etended to the Mutcass case. 6.3 Summar Kernes are a ver poerfu too hen deang th nonnear separabe datasets. The usage of the Kerne Trck has ong been knon and therefore as studed n deta. B ts usage the probem to sove no st stas the same as n the prevous chapters but the dot product n the formuas s rertten usng the mpct kerne mappng. So the probem can be stated as: L Mamze b ξ β W D K ; subect to C

45 79 And th the same KT-condtons as n the summar under Then the overa decson functon for some unseen data z becomes: 6.7 Note: Ths Kerne-representaton no be used and to gve the nk to the near case of chapter 5 here ; K s repaced b ths kerne be caed the Lnear Kerne. ; sgn ; sgn sgn SV b K b K b b f z z z z

46 Chapter 7 Mode Seecton Though as ntroduced n the ast chapter thout budng an on kerne based on the knoedge about the probem at hand as a frst tr t s ntutve to use the four common and e knon kernes. Ths approach s man used as the eampes n append A sho. But as a frst step there s the choce hch kerne to use for the begnnng. Afterards the penat parameter C and the kerne parameters have to be chosen too. 7. The RBF Kerne As suggested n [Ln3] the RBF kerne s n genera a reasonabe frst choce. Athough f the probem at hand s near the same as some aread formdabe soved ones hand dgt recognton face recognton hch are documented n deta a frst tr shoud be gven to the kernes used there. But the parameters most have to be chosen n other ranges appcabe to the actua probem. Some eampes of such aread soved probems and nks to further readngs about them be gven n append A. As shon n the ast chapter the RBF kerne as others maps sampes nto a hgher dmensona space so t s abe to n contrast to the near kerne hande the case here the reaton beteen cass abes and the attrbutes s nonnear. Furthermore the near kerne s a speca case of the RBF one as [Ke3] shos that the near kerne th a penat parameter C has the same performance as the RBF kerne th some parameters C γ 5. In addton the sgmod kerne behaves ke RBF for certan parameters [L3]. Another reason s the number of hperparameters hch nfuence the compet of mode seecton. The ponoma kerne has more of them than the RBF kerne. 5 γ σ 8

47 Fna the RBF kerne has ess numerca dffcutes. One ke pont s < K n contrast to ponoma kernes of hch the kerne vaues ma go toards nfnt. Moreover as sad n the ast chapter the sgmod kerne s not vad.e. not the nner product of to vectors under some parameters. 7. Cross-Vadaton In the case of RBF kernes there are to tunng parameters: C and σ. It s not knon beforehand hch vaues for them are the best for the probem at hand. So some parameter search must be done to dentf the optma ones. Optma ones means fndng C and σ so that the cassfer can accurate predct unknon data after tranng.e. testng data. In ths a t not be usefu to acheve hgh tranng accurac b the cost of generazaton abt. Therefore a common a s to separate the tranng data nto to parts of hch one s consdered unknon n tranng the cassfer. Then the predcton accurac on ths set can more precse refect the performance on cassfng unknon data. An mproved verson of ths technque s knon as cross-vadaton. In the so caed k-fod cross-vadaton the tranng set s dvded nto k subsets of equa sze. Sequenta one subset s tested usng the cassfer traned on the remanng k subsets. Thus each nstance of the hoe tranng set s predcted once so the cross-vadaton accurac s the percentage of data hch are correct cassfed. The man dsadvantage of ths procedure s ts computatona ntenst because the mode has to be traned k tmes. A more smpe technque can be etracted from ths mode b choosng k-fod cross-vadaton th k. Ths means sequenta removng the -th tranng subset and tran th the remanng subsets. Ths procedure s knon as Leave-One-Out oo. Another technque s knon as Grd-Search. Ths approach has been chosen b [Ln3]. The man dea behnd ths s to basca tr pars of C γ and the one th the best cross-vadaton accurac s pcked. Mentoned n ths paper s the observaton that trng eponenta grong sequences of C and γ s a practca a to fnd good parameters. Ths means e.g. C... ; γ.... Sure ths search method s straghtforard and stupd n some a. But as sad n the paper above sure there are advanced technques for grd-searchng but the are an ehaustve parameter search b appromaton or heurstcs. Another reason s that t has been shon that the computatona tme to fnd good parameters b the orgna grd-search s not much more than that b advanced methods snce there are st the same to parameters to be optmzed. 8

48 Chapter 8 Mutcass Cassfcaton Up to no the stud has been mted to the to-cass case caed the bnar case here on to casses of data have to be separated. Hoever n rea-ord probems there are n genera m casses to dea th. The n tranng set st conssts of pars here R but no {... n} th. The frst straghtforard dea be to reduce the Mutcass probem to man to-cass probems so each resutng cass s separated from the remanng ones. 8. One-Versus-Rest OVR So as mentoned above the frst dea for a procedure to construct a mutcass cassfer s the constructon of n to-cass cassfers th foong decson functons: f z sgn k z b ; k n 8. k k Ths means that the cassfer for cass k separates ths cass from a other casses: f k f beongs to cass k - otherse So the step-b-step procedure starts th cass one: construct frst bnar cassfer for cass postve versus a others negatve cass versus a others. cass k n versus a others. The resutng combned OVR decson functon chooses the cass for a sampe that corresponds to the mamum vaue of k bnar decson functons.e. the furthest postve hperpane. For carfcaton see fgure 8. and tabe 8.. Ths hoe frst approach to gan a Mutcass cassfer s computatona ver epensve because there s need of sovng n 8

49 quadratc programmng QP optmzaton probems of sze tranng set sze no. As an eampe consder a three-cass probem th near kerne ntroduced n fgure 8.. The OVR method eds a decson surface dvded b three separatng hperpanes the dashed nes. The shaded regons n the fgure correspond to te stuatons here to or none cassfers are actve.e. vote postve at the same tme aso see tabe 8.. Fgure 8.: OVR apped to a three-cass A B C eampe th near kerne No consder the cassfcaton of a ne unseen sampe heagona n fgure 8. n the ambguous regon 3. Ths sampe receves postve votes from both the A-cass and C-cass bnar cassfers. Hoever the dstance of the sampe from the A-cass-vs.-a hperpane s arger than from the C-cass-vs.-a one. Hence the sampe s cassfed to beong to the A cass. In the same a the ambguous regon 7 th no votes s handed. So the fna combned OVR decson functon resuts n the decson surface separated b the sod ne n fgure 8.. Notce hoever that ths fna decson functon sgnfcant dffers from the orgna one hch corresponded to the souton of k here 3 QP optmzaton probems. The maor draback here s therefore that on three ponts back bas n fgure 8. of the resutng bordernes concde th the orgna ones cacuated b the n Support Vector Machnes. So t seems that the benefts of mama 83

50 margn hperpanes are ost. Summarzed t can be sad that ths s the smpest Mutcass SVM method [Krs99 and Stat]. Regon A vs. B and C Decson of the cassfer B vs. A and C C vs. A and B Resutng cass - B C? - - C C 3 A - C? 4 A - - A 5 A B -? 6 - B - B ? Tabe 8.: Three bnar OVR cassfers apped to the correspondng eampe fgure 8.. The coumn Resutng cass contans the resutng cassfcaton of each regon. Ces th? correspond to te stuatons hen to or none cassfer are actve at the same tme. See tet for ho tes are resoved. 8. One-Versus-One OVO The dea behnd ths approach s to construct a decson functon f : R n { } for each par of casses k m ; k m n: km f km f beongs to cass k - f beongs to cass m n n n So n tota there are pars because ths technque nvoves the constructon of the standard bnar cassfer for a pars of casses. In other ords for ever par of casses a bnar SVM s soved th the underng optmzaton probem to mamze the margn. The decson functon therefore assgns an nstance to a cass hch has the argest number of votes after the sampe has been tested aganst a decson func- 84

51 n n n tons. So the cassfcaton no nvoves comparsons and n each one the cass to hch the sampe beongs n that bnar case get s a added to ts number of votes Ma Wns strateg. Sure that there can st be te stuatons. In such a case the sampe be assgned based on the cassfcaton provded b the furthest hperpane as n the OVR case [Krs99 and Stat]. As some researchers have proposed ths can be smpfed b choosng the cass th the oest nde hen a te occurs because even then the resuts are st most accurate and appromated enough [Ln3] thout addtona computaton of dstances. But ths has to be verfed for the probem at hand. The man beneft of ths approach s that for ever par of casses the optmzaton probem to dea th s much smaer.e. n tota there s on need of sovng nn-/ QP probems of sze smaer than tranng set sze because there are on to casses nvoved and not the hoe tranng set n each probem as n the OVR approach. Agan consder the three-cass eampe from the prevous chapter. Usng the OVO technque th a near kerne a decson surface s dvded b three separate hperpanes dashed nes obtaned b the bnar SVMs see fgure 8.. The appcaton of Ma Wns strateg see tabe 8. resuts n the dvson of the decson surface nto three regons separated b the thcker dashed nes and the sma shaded ambguous regon n the mdde. After the te-brakng strateg from above furthest hperpane apped to the ambguous regon 7 n the mdde the fna decson functon becomes the sod back nes and the thcker dashed ones together. Notce here that the fna decson functon does not dffer sgnfcant from the orgna one correspondng to the souton of nn-/ optmzaton probems. So the man advantage here n contrast to the OVR technque s the fact that the fna bordernes are parts of the cacuated par se decson functons hch as not the case n the OVR approach. 85

MARKOV CHAIN AND HIDDEN MARKOV MODEL

MARKOV CHAIN AND HIDDEN MARKOV MODEL JIAN ZHANG JIANZHAN@STAT.PURDUE.EDU Markov chan and hdden Markov mode are probaby the smpest modes whch can be used to mode sequenta data,.e. data sampes whch are not