Homogenised Virtual Support Vector Machines

Size: px

Start display at page:

Download "Homogenised Virtual Support Vector Machines"

Jennifer Gibbs
5 years ago
Views:

1 Homogensed Vrtual Support Vector Machnes Chrstan J. Walder 1,2 1 Max Planck Insttute for Bologcal Cybernetcs Spemannstaße 38, Tübngen, Germany. Bran C. Lovell 2 2 IRIS Research Group, EMI School of Informaton Technology and Electrcal Engneerng, The Unversty of Queensland, Brsbane, Queensland 4072, Australa. chrstan.walder@tuebngen.mpg.de, lovell@tee.uq.edu.au Abstract In many domans, relable a pror knowledge exsts that may be used to mprove classfer performance. For example n handwrtten dgt recognton, such a pror knowledge may nclude classfcaton nvarance wth respect to mage translatons and rotatons. In ths paper, we present a new generalsaton of the Support Vector Machne SVM that ams to better ncorporate ths knowledge. The method s an extenson of the Vrtual SVM, and penalses an approxmaton of the varance of the decson functon across each grouped set of vrtual examples, thus utlsng the fact that these groups should deally be assgned smlar class membershp probabltes. The method s shown to be an effcent approxmaton of the nvarant SVM of Chapelle and Schölkopf, wth the advantage that t can be solved by trval modfcaton to standard SVM optmzaton packages and neglgble ncrease n computatonal complexty when compared wth the Vrtual SVM. The effcacy of the method s demonstrated on a smple problem. 1 Introducton In recent years Vapnk s Support Vector Machne SVM classfer [9] has become establshed among the most effectve classfers on real-world problems. As remarked n a comparson of classfers appled to handwrtten dgt recognton wrtten by LeCun et al n 1995, SVMs are capable of achevng hgh generalsaton performance wth none of the a pror knowledge that s necessary n order to acheve smlar results wth other methods [4]. In the years snce ths comparson was conducted, SVM researchers have managed to ncorporate problem specfc pror knowledge nto the SVM algorthm. One very effectve approach to date, as far as performance on the MNIST handwrtten dgt database s concerned, has been the Vrtual SVM V-SVM method of Scholköpf et al [5]. Indeed, to the best of our knowledge ths method has acheved the lowest recorded test set error on the MNIST set, wth trval data preprocessng [3]. The V-SVM method nvolves frst tranng a normal SVM n order to extract the support vector set. A set of transformatons that are known to have no effect on the lkelhood of class membershp are then appled to these vectors, producng vrtual support vectors. For example, n the case of handwrtten dgts, new vrtual examples can be created by randomly shftng and rotatng n the 2D mage sense the orgnal tranng dgts. A new machne s then traned on the unon of the orgnal support vector set and the synthetc vrtual support vectors. In V-SVM, when the machne s retraned on the augmented tranng set, no dstncton s made between the real and vrtual data vectors. As such, a potentally useful pece of nformaton about the problem s beng dscarded, namely that we know the decson latent functon tself should be nvarant to the transformatons appled to the support vectors not just the classfer output that corresponds to the sgn of ths latent functon. Indeed, precsely ths nformaton has been ncorporated n the Invarant SVM I-SVM method of Chapelle and Scholköpf [2], whch has prevously been mplemented usng deas from kernel PCA [7]. The present work les n between the I- SVM and V-SVM, n that the vrtual examples are ncluded as n V-SVM but the I-SVM lke nvarance of the latent functon are mposed only on the support vectors. As we shall see however, the present approach represents a rather effcent approxmaton to the I-SVM as t can be mple- 1

2 mented by trval modfcaton to exstng SVM optmzaton software such as LIBSVM [1], wth the same computatonal cost as the V-SVM. The remander of the paper s structured as follows: n Secton 2 we brefly revew the soft margn SVM, before ntroducng the I-SVM n Secton 3. In Secton 4 the man contrbuton of the paper begns wth the dervaton of the necessary equatons for the new approach. In Secton 5 we demonstrate the effcacy of the new approach on a smple toy problem, showng mproved performance as compared to the state of the art and rather hard to beat V-SVM. 2 Soft-Margn Support Vector Machnes In normal squared loss soft-margn SVM classfcaton [9], we have a set of labelled ponts x R d, = 1... N, wth assocated class labels y {1, 1}. The standard SVM formulaton solves the followng problem: Mnmze w,b w, w + C N =1 ξ2 Subject To: y w, x + b 1 ξ, = 1... N ξ 0, = 1... N where C s a regularsaton parameter. In the Lagrangan dual of the above problem, the data vectors appear only by way of ther nner products wth one another. Ths allows the problem to be solved n a possbly nfnte dmensonal feature space H by way of the kernel trck,.e., the replacement of all nner products x, x j by some Mercer kernel k x, x j = φ x.φ x j, where φ : R d H see e.g. [8]. By dualsng the above problem n ths manner, we obtan the followng fnal decson functon: g x = sgn f x, wheref x = N α 0 y k x, x + b =1 where the α 0 are obtaned by maxmsng the dual objectve functon: W α = N α 1 2 =1 N α α j y y j k x, x j,j=1 subject to the constrants N =1 α y = 0 and α 0. 3 Invarant Support Vector Machnes Prevously, Chapelle and Schölkopf [6] ncorporated doman specfc pror knowledge nto the lnear SVM framework by mnmsng the objectve functon: 1 w, w + w, d x 2 subject to the usual SVM constrants see Secton 2. The tangent vectors d x are drectons n whch a pror knowledge tells us that the functonal value at x should not change the sense of the optmzaton can be seen from the equalty: f x + d x f x = w, d x It turns out that the above problem s equvalent to the normal SVM after lnearly transformng the nput space by x C 1 x where the matrx C s defned by: C = 1 I + d x d x 1 2 whch was shown to be equvalent to usng the followng kernel functon n the otherwse unchanged SVM framework: k x, x j = x C 2 x j 1 However n order to use the kernel trck to perform ths algorthm n the feature space H nduced by k,, as n the prevous Secton, one must replace the term d x d x wth dφ x dφ x n the expresson for C. Snce H s typcally very hgh dmensonal, t s mpossble to do ths drectly. A soluton to ths problem usng kernel PCA [7] s gven n [2], whch takes advantage of the fact that the set {φ x } 1 N spans a subset of H whose dmenson s no greater than N. Unfortunately however, computng the modfed kernel functon n ths manner does ntroduce computatonal dsadvantages, and the need for a large scale verson of the algorthm was noted by ts authors n [2]. 4 Homogensed Vrtual SVM In the I-SVM method of the prevous Secton, the tangent vectors d x are usually not avalable drectly, and so a fnte dfference approxmaton s used. In ths case the term w, d x 2 can be wrtten: w, x + x w, x 2 2 where the vector x + x s equvalent to a vrtual vector of the V-SVM approach that has been derved from x.

3 In the V-SVM method, the vrtual vectors are derved from the real vectors by some group of nvarant transformatons, that s transformatons that should not affect the lkelhood of class membershp. For example, n handwrtten dgt recognton the group of one-pxel mage translatons have been used to good effect [3] n ths case, for each of the orgnal tranng patterns an addtonal 8 vrtual vectors can be derved, one for each possble sngle pxel translaton. We wll presently consder a combnaton of the I-SVM and the V-SVM, whch we dub the Homogensed Vrtual SVM HV-SVM. To begn we combne the ncluson of the vrtual examples as n the V-SVM method, wth the nvarance term 2, allowng as well a soft margn wth parameter C as n Secton 2. Altogether ths leads to the objectve functon: w, w + C P n=1 ξ 2 +,j S n w, x w, x j 2 3 where each of the P sets S n contan the subscrpts of those vectors that are nvarant transformatons of one another. Ths effectvely extracts a fnte dfference approxmaton of an nvarant drecton for each par of vectors n the same set S p. The term 1 2 s ncluded for an equvalence wth I-SVM, snce here each nvarant drecton s effectvely ncluded twce. The objectve functon must be mnmzed subject to the normal SVM constrants of Secton 2. It s well known by the SVM communty that these constrants lead to the followng optmalty condtons: α y w, x + b 1 + ξ = 0 whch mply that for those vectors that have α > 0 the support vectors, then ξ = 1 y w, x + b. Ths means that f x and x j are support vectors belongng to the same set S p, then y = y j {1, 1} and therefore w, x w, x j = ξ ξ j. Assumng that all vectors are support vectors, the objectve functon 3 can therefore be rewrtten: 1 w, w + C ξ P ξ ξ j 2 2 n=1,j S n 4 In realty however, not all of the vectors wll be support vectors, and thus we effectvely have an approxmaton that becomes more deal as the number of support vectors ncreases. At ths pont we should note that the V-SVM method usually retans only the support vectors from a prelmnary tranng teraton before dervng and retranng wth the vrtual examples, and that by a smlar process we can reasonably expect a hgh proporton of support vectors. Moreover, the approxmaton s roughly equvalent to mnmsng the nvarance term of only those vectors that le near the decson boundary the support vectors, whch also seems reasonable snce those are the vectors that are wdely beleved to contan the most mportant nformaton. We now fnd the Lagrangan dual of ths approxmaton to I-SVM. The algebra s less of a burden f we assume to begn wth that all of the vectors are n the same set S 1. Dvdng the resultant objectve functon by 1 gves w, w +C ξ2 + 21,j ξ ξ j 2 the summatons run from = 1... N, j = 1... N, as shall be assumed for the remander of ths secton. Note that: C ξ2 + 21,j ξ ξ j 2 ξ 2 + ξj 2 2ξ ξ j = C ξ2 + = C + N 1 21,j ξ2 1 N,j ξ ξ j so that f we let F = C + 1 and G = objectve functon can be rewrtten as: w, w + F ξ 2 + G,j ξ ξ j 1 the The Lagrangan functon wth Lagrange multplers α s then: L w, b, ξ, α = 1 2 w, w F ξ G,j ξ ξ j and the statonarty condtons are: α y w, x + b + ξ 1 L w = 0 = w α y x 5 L b = 0 = α y 6 L ξ = 0 = F ξ + G j ξ j α 7 so w has the usual support vector expanson w = α y x. Substtutng 5 and 6 nto the Lagrangan yelds: L = 1 2,j α α j y y j x, x j + α α ξ + F 1 ξ 2 + G 1 ξ ξ j 2 2 = 1 2 α H α + ẽ α α ξ F ξ ξ G ξ E ξ where H s the usual Gram matrx, [H,j ] = y y j x, x j, E s a matrx of ones and I the dentty matrx. Rewrtng,j

4 7 n matrx form and left-multplyng by ξ mples that F ξ ξ + G ξ E ξ α ξ = 0. The Lagrangan can therefore be wrtten: L w, b, ξ, α = 1 2 α H α + ẽ α 1 2 α ξ Fnally, we substtute the expresson for ξ obtaned from 7, ξ = F I + GE 1 α, whch leads to the fnal form of our dual objectve functon: W α = 1 2 α H + F I + GE 1 α + ẽ α F F +GN = C1 + CC1 +N G F F +GN = CC1 +N At ths pont one can show that the matrx F I + GE 1 has entres equal to F +N 1G on the dagonal and elsewhere. Usng these expressons we can now compactly wrte down the fnal form of the dual problem for 4. In ths case one can readly verfy that the sets S n are treated smlarly to the prevous case n whch t was assumed that there s one sngle all encompassng set S 1 = {1... N}, resultng n a dual problem that conssts of mnmsng the dual objectve functon 1 2 α H + B α ẽ α subject to the normal SVM constrants see Secton 2. Referrng to the S n as groups, the matrx B s defned on the dagonal by: [B], = { C1 + CC1 +N 1 C1 f x s n a group otherwse where N s the total number of vectors n x s group. Off the dagonal, B s defned by: { CC1 +N [B],j = f x and x j are same group 0 otherwse 9 Note that f = 0 or f there are no non-empty groups, then B smplfes to a dagonal matrx wth entres 1 C, recoverng the soft-margn SVM of Secton 2. The smlarty wth normal SVM allows the problem to be solved by trval modfcaton to exstng SVM optmzaton packages such as LIBSVM [1] one need smply replace the typcal dagonal bas terms wth the B term above Hybrd HV/I - SVM: Note that t s also possble to derve a hybrd method that s closer to the I-SVM, n the sense that t gnores only the nvarance terms of those entre sets S n that contan no support vectors. If V n are the support vectors, and V n the non support vectors n group S n, the nvarance term of the HV- SVM method for that group s: 1 2,j V w, x n w, x j V n,j V ξ2 n 8 Fgure 1. An llustraton of the effect of the parameter n separatng pluses from crcles n 2D. An nvarance to rotatons s encoded by choosng vrtual vectors that are rotatons of the tranng data, each nvarant group S n beng ndcated by a dotted lne. The gray level ndcates the decson functon value and the whte lne the decson boundary. Left: Vrtual SVM, Mddle: HV-SVM wth medum Rght HV- SVM wth close to one.

5 whch, assumng that we are amng for the same result as the I-SVM, s not what s requred unless V n s emtpy. However f we had known V n a pror and set [B],j = 0 f ether V n or j V n, then the second term n the above nvarance term would equal zero. Thus, had we also modfed the kernel functon as per the I-SVM method wth nvarance drectons { x x j : V n j Vn } then we would have establshed once agan the correct I- SVM nvarance term for S n. Ths would need to have been done for all the S n, but as we do not know the V n a pror t would be necessary to use for example approach of Algorthm 1, below, for determnng whch nvarance drectons need to be accounted for by the I-SVM kernel functon, rather than by the more effcent HV-SVM bas terms [B],j. In the algorthm, the nvarance terms assocated wth non support vectors are gnored. Note that t should usually be possble to perform the retranng wth fewer teratons than the ntal pass, by usng the prevous soluton as the startng pont for the optmzaton. Algorthm 1 HV-SVM/I-SVM Hybrd 1: W 2: [B],j 0,, j 3: for all {, j : W j W } do 4: assgn [B],j accordng to equatons 8 and 9 5: end for 6: Defne k by equaton 1 usng the tangent vectors assocated wth W 7: Optmse usng k and B, and let V be the resultant support vector set 8: R n:s n V = S n 9: f R V W then 10: fnsh 11: else 12: W W R V 13: Go to lne 2 14: end f C Fgure 2. Cross valdaton performance for the crcle toy problem, averaged over 100 trals. The gray-scale ndcates mean test error percentage. The C scale s logarthmc, and the scale s logarthmc wth the excepton of the frst value whch s zero. Note that = 0 s equvalent to the Vrtual SVM method, and that the nvarances are enforced more as 1. Very lttle performance change s found by extendng the plot to greater C values. the second order polynomal kernel k x, ỹ = x ỹ The results of the experment are shown n Fgure 2, whch ndcate that for any gven value of the margn softness parameter C, the more the nvarance s enforced 1 the better the test set performance becomes. Moreover t can be seen from the example n Fgure 1 that as expected, the nvarance term leads to decson boundares of a more crcular nature Experments Our tests on real world problems are the subject of ongong work. In the present secton we nstead consder the followng toy problem: the data are unformly dstrbuted over the nterval [ 1, 1] 2 and the true decson boundary s a crcle centered at the orgn, namely f x = sgn x 1/2. The nvarance we wsh to encode s local nvarance under rotatons, and so we derve from each tranng vector a sngle vrtual vector by applyng a rotaton of 0.5 radans about the orgn. A tranng set of 15 ponts and test set of 2000 ponts were generated ndependently n 100 ndvdual trals. For the kernel functon we chose 6 Concluson We have presented a generalsaton of the Vrtual SVM n whch the support vectors that are nvarant transformatons of one another are constraned to have smlar decson functon values. We have demonstrated that the method s superor to the state of the art Vrtual SVM n a toy problem nvolvng nvarances under rotatons see Fgure 2. Moreover, the algorthm s easy to mplement, as the fnal optmsaton problem s very smlar to that of the standard SVM, and ncurs no addtonal computatonal penalty n comparson wth the Vrtual SVM. We plan to perform more realworld tests snce the Vrtual SVM represents the state of the art n dgt recognton t seems nterestng to apply the

6 method to ths problem. References [1] C.-C. Chang and C.-J. Ln. LIBSVM: a lbrary for support vector machnes, Software avalable at cse.ntu.edu.tw/ cjln/lbsvm. [2] O. Chapelle and B. Schölkopf. Incorporatng nvarances n nonlnear support vector machnes, [3] D. DeCoste and B. Schölkopf. Tranng nvarant support vector machnes. Machne Learnng, 46: , [4] Y. LeCun, L. Jackel, L. Bottou, A. Brunot, C. Cortes, J. Denker, H. Drucker, I. Guyon, U. Muller, E. Sacknger, P. Smard, and V. Vapnk. Comparson of learnng algorthms for handwrtten dgt recognton, [5] B. Schölkopf, C. Burges, and V. Vapnk. Incorporatng nvarances n support vector learnng machnes. In J. V. C. von der Malsburg, W. von Seelen and B. Sendhoff, edtors, Artfcal Neural Networks ICANN 96, volume 1112, pages 47 52, Berln, Sprnger Lecture Notes n Computer Scence. [6] B. Schölkopf, P. Smard, A. Smola, and V. Vapnk. Pror knowledge n support vector kernels. In Proceedngs of the 1997 conference on Advances n neural nformaton processng systems 10, pages MIT Press, [7] B. Schölkopf, A. Smola, and K.-R. Muller. Nonlnear component analyss as a kernel egenvalue problem, [8] B. Scholköpf and A. J. Smola. Learnng wth Kernels. MIT Press, [9] V. N. Vapnk. The Nature of Statstcal Learnng Theory. Sprnger Verlag, 1995.

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest