Support Vector Machines and Kernel Methods

Size: px

Start display at page:

Download "Support Vector Machines and Kernel Methods"

Julie O’Neal’
5 years ago
Views:

1 Support Vector Machies ad Kerel Methods Daiel Khashabi Fall 202 Last Update: September 26, 206 Itroductio I Support Vector Machies the goal is to fid a separator betwee data which has the largest margi, as it ca be see i the Figure. Note that, i the Perceptro algorithms, the goal is just to fid a separator for the data, although such separator might ot be a large-margi separator. I the followig sectios, we will start with the basic formulatio of the SVM, ad cotiue to the more advaced represetatios of the model. 2 Simple classificatio usig Support Vector Machies Suppose we choose a group of data poits, which could reasoably separate iformatio regios. These data poits that lie close to separatio regios, selected amog all the iput data, are commoly called support vectors. Assume that we have group of data {x i, y i }that could be separated by a hyperplae. Thus we ca write the followig statemets about the separatig hyperplaes, { β.xi + β 0 +, if y i + β.x i + β 0, if y i. Equivaletly we could write the above separatig equatios as follows: y i. (β.x i + β 0 ), i. I the above formulatio, β 0 is the bias weight. To cotiue with simpler formulatio, we do the followig reformulatios: { β [β, β 0 ] Now we will ad the problem becomes, x i [x i, +] y i β.x i, i. I this formulatio, is the size of the margi. Istead of fixig this value, we ca optimize over it:

2 Figure : Max-margi scheme for support vector classificatio. { maxγ,β γ y i β.x i β γ, i. Sice oly β is importat, ad to reduce the umber of the parameters, we ca set γ β. The the previous program could be writte i the followig form: { mi β 2 β 2 () y i β.x i, i. This is the optimality criterio for separatio of two hyperplaes. The miimizatio criterio, mi β 2 β 2, maximizes the coefficiet vector, β, while preservig the separatio costraits. Oe other iterpretatio for the above optimizatio criterio is as it follows. Cosider the Figure, ad the two data poits o the margi of each area, (x, y +) ad (x 2, y 2 ). For these two data poits we have, { β.x +, if y + β.x 2, if y 2. β. (x x 2 ) 2 x x 2 2 β Here is a ice iterpretatio: I order to maximize the separatig margi x x 2 betwee data poits, it suffices to miimize β or miimize β 2. The formulatio i the Equatio is called hard-margi SVM, ad it is the primal form. The objective is a quadratic fuctio, ad liear costraits, ad therefore we have a quadratic optimizatio (ad hece a covex problem). Would it be eough to use a stadard quadratic solver to solve the SVM problem? Ideed oe ca use quadratic solver for SVM, but may early studies showed that, sice SVM problem is a special case of geeral quadratic programs, adhoc solutios to SVM, usually give better ad faster solutios to SVM, tha geeral solvers. Here we will derive multiple direct solutios to the problem. Usig β 2 is just for simplicity ad ease of otatio 2

3 Oe ca optimize the costraied program i the Equatio usig Lagrage multilpiers. Now first form the Lagragia, with Lagrage multilpiers {λ i 0} added, as followig: L(β, λ) 2 β 2 i λ i (y i β.x i ). (2) Note that we wish to fid saddle poit of L(β, λ): max λ mi L(β, λ) β The complemetary slackess coditio says that essetially: λ i (y i β.x i ) 0, i I other words: If y i β.x i > 0, the essetially λ i 0. If λ i > 0, the essetially y i β.x i. Such poits are called support vectors. The above Lagrage fuctio satisfies the ecessary coditios, β L 0 β i λ iy i x i, λ i 0, λ i y i β.x i 0, λ i {λ i y i β.x i } 0. By replacig β ito the Lagragia, oe ca fid the dual problem which essetially has the same solutio as the mai problem, L(β, λ) i λ i λ i λ j y i y j (x i.x j ) 2 i,j The full dual program is the followig: {max λi i λ i 2 subject to: i,j λ iλ j y i y j (x i.x j ) i λ iy i 0, λ i 0 (3) Now it suffices to solve the dual problem for λ i > 0, ad fid the coefficiet vector β i λ iy i x i. For predictio o ew poits, we ca ow do sig (β.x). Now let s assume that we wat to do classificatio o a o-liear space. Somethig importat to otice is that, the iput variables eter to the optimizatio via ier product. We ca use this fact, ad project the variables x ito aother space, which has ier product. More specifically we defie the fuctio: Φ : X F, Usig this fuctio, we replace the variable x, with the ew high-dimesio variable Φ (x). Now we defie otio of kerel which appears i differet occasios, ad gets a more practical iterpretatios. We defie a kerel as k(x i, x j ) Φ (x i ), Φ (x j ). To get more ituitio ito kerels 3

4 ad covice ourselves about usefuless of this defiitio let s go back ad see the formulatios based o ew feature space, Φ (x). The dual formulatio ad the predictio formulatios are, max λ i i λ i 2 i,j λ iλ j y i y j subject to: i λ iy i 0, λ i 0, k(x i,x j ) {}}{ (Φ (x i ).Φ (x j )), The above formulatio of problem, shows how i ew formulatio the ier product of variables appear i cojuctio with each other which we call it kerel. Now we ca iterpret the predictio, as liear combiatio of kerels, defied by a subset of iput data: f(x) sig (β.φ (x)) sig λ i y i (Φ (x i ).Φ (x)). }{{} i k(x i,x) ( ) sig λ i y i k(x i, x). Remark. The defiitio of the stadard SVM, has two importat mai poits: Max-margi criterio Projectio of features ito arbitrary space (the kerel trick) Example (Gaussia Kerel). The followig is the defiitio of a Gaussia kerel: ) u v 2 k G (u, v) exp ( 2σ 2 φ(u), φ(v) i It ca be show that, for the above Gaussia kerel, the projectio fuctio φ(u) is of ifiite dimesio! Example 2 (Polyomial Kerel). The followig is the defiitio of a Polyomial kerel: Exercise. Give the followig kerels { K(x, x ) φ(x).φ(x ) Prove that: For ay costat c, it is a valid kerel. For ay costat c, ck is a valid kerel. K K is a valid kerel. K + K is a valid kerel. k ( u, v) ( + u.v) d, for ay d (4) K (x, x ) φ (x).φ (x ) The polyomial kerel (defied i the Equatio 4) is a valid kerel. 4

5 2. A simple guratee Here we give a simple error guratee based o the umber of support vectors. Suppose h S is the hypothesis retured by some algorithm, leared o dataset S. The leave-oe-out error of the algorithm o data S is defied by averagig the error of the algorithm o istace x, whe it is traied o the rest of the istaces S \ {x}: ˆR loo m m {h S\{i} (x i ) y i } i Lemma. The expected leave-oe-out error o m istaces is a ubiased estimate of the expected geeralizatio error over over m istaces. [ ] E S D m ˆRloo E S D m [R S ] proof sketch. Distribute the expectatio over the sum ad decompose it ito two idepedet expectatios. Lemma 2. Let h S be the hypotheis retured by SVM algorithm whe traied o S dataset of size m ad let # SV be the umber of support vectors i this resut. The E S D m [R S ] # SV m + proof sketch. If a poit x is ot support vector the h S ad h S\{x} should be the same; i other words h S\{x} wil gave a correct predictio o x. If a poit x is a support vector the h S\{x} might make a mistake o x. Replacig these results i the defiitio of leave-oe-out error, usig the previous lemma ad followed by expectatio we get the desired result. 3 Soft SVM Istead of havig hard margis, i may cases we may wat to compromise a little, to get more geeralizatio power. So we impose slack variables ξ 0 which imposes more flexibility o the separatio margis, { β.xi + ξ i, if y i + β.x i + ξ i, if y i. Also, we wat to puish the algorithm wheever there is a o-zero slack: 2 β 2 + C i I other words, we let the algorithm to make a few mistakes, but pay for its cost. Similar to the Equatio 2, oe could solve the above program. The same problem could be writte i the followig form: mi β 2 β 2 + C max {0, y i β.x i } (5) i Defie the hige loss to be the followig: φ(α) ( α) + max {0, α} ξ i 5

6 Oe other iterpretatio of this model is that, we pealize margi violatios with a hige loss; as log as y i β.x i > the model is ot pealized. Whe y i β.x i <, it is pealized with weight C. Similar to the previous case, we ca form the Lagragia, form the dual ad fid the updates of the model. L(β, ξ, λ, η) 2 β 2 + C i ξ i i λ i { y i β.x i ξ i } i η i ξ i We first remove the primal variables from the above Lagragia: β L 0 β i λ i y i x i ξ L 0 λ i + η i C By replacig the above equalities i the Lagragia, we get the followig: max λi i λ i 2 i,j λ iλ j y i y j (x i.x j ) subject to: i λ iy i 0 λ i 0, η i 0 λ i + η i C Ad we ca easily elimiate η i ad ed up with the followig program: max λi i λ i 2 i,j λ iλ j y i y j (x i.x j ) subject to: i λ iy i 0 λ i 0 0 λ i C (6) How differet is the dual of the soft-svm (Equatio 6) from dual of the hard-svm (Equatio 3)? The oly differece is that, there is a upper boud C o the dual variables. The iterpretatio is that, we caot put too much weight ay poit. Remark 2. If C is bigger tha the biggest λ i, the the soft-svm is equivalet to the hard-svm. Remark 3. This form of SVM is usually kow as C-SVM. 4 Kerels ad Hilbert spaces Theorem (Mercer s theorem). Suppose K is a cotiuous symmetric o-egative defiite kerel. The there is a set of orthoormal basis { ϕ i L 2 (X, P ) } cosistig of eigefuctios of T k, i.e. T K ϕ j λ j ϕ j, such that the correspodig sequece of eigevalues {λ i } is oegative. The eigefuctios correspodig to o-zero eigevalues are cotiuous o X ad K have the represetatio K(s, t) λ j ϕ j (s) ϕ j (t), s, t X j where the covergece is absolute ad uiform. 6

7 4. Reproducig Kerel Hilbert Spaces(RKHS) The RKHS property says that, projectig ay fuctio i L K (X) will produce exact the same fuctio: f, K(x,.) K j c j K(x j,.), K(x,.) K j j c j K(x j,.), K(x,.) K c j K(x j, x) f(x) Aother represetatio of RKHS is based o the eige fuctios spaig the space of the kerels. Ay fuctio f L K (X) ca be represeted as, f(x) c i K(x i, x) c i λ j ϕ j (x i ) ϕ j (x) c i λ j ϕ j (x i ) ϕ j (x) d j ϕ j (x) i i j j i j Example 3. Let X be a compact (i.e. closed ad bouded) subset of R d, ad let K : X X R be Mercer kerel defied over X. With a fixed probability distributio P o X, cosider the Hilbert space L 2 (X, P ) of fuctios g : X R, where, g 2 (x)p (dx) < with the orm defied as, g, g Also cosider the operator T K [T K φ](x) X X g(x)g (x)p (dx) E [ g(x)g (X) ] X K(x, t)ϕ(t)p (dt), x X which maps a fuctio??. For a give kerel K, defie L K (X) to be the set of all fuctios such that, f(x) j c j K(x j, x) Usig Mercer s reproducig kerel theorem, prove that,. Let J {j N : λ j > 0}, ad for each j J defie the fuctio ψ ϕ j λj. The {ψ j } j J is a orthoormal system i the RKHS H K, i.e. ψ j, ψ k K δ jk, for all j, k J. 2. Let F be the uit ball of H K, ad let X, X 2,..., X be draw i.i.d. from P. The ER (F(X )) + λ j 7 j

8 Figure 2: Decisio Boudaries 5 Exercise Problems Cosider a dataset with 3 poits i D:. Are the classes ± liearly separable? {(+, 0), (, ), (, +)} 2. Cosider mappig each poit to 3D usig ew feature vectors Φ(x) [, x 2, x 2]. Are the classes ow liearly separable? If so, fid a separatig hyperplae. 3. Cosider the formulatio of the soft-margi primal SVM, for a give traiig data: D {(x i, y i ) x i R p, y i {, }} i { arg mi w,ξ,b 2 w 2 + C } ξ i i y i (w x i b) ξ i, ξ i 0, i,.., i,.., Also remember the hard-margi primal SVM σ i 0, i. Ad remember that we ca derive the dual formulatio ad replace each x.x with a kerel fuctio k(x, x ). Mach each of the followigs with a decisio boudary i Figure 2: (a) A soft-margi liear SVM with C 0.. 8

9 (b) A soft-margi liear SVM with C 0. (c) A hard-margi kerel SVM with kerel k(u, v) u.v + (u.v) 2. (d) A hard-margi kerel SVM with kerel k(u, v) exp ( 4 u v 2). (e) A hard-margi kerel SVM with kerel k(u, v) exp ( 4 u v 2). 4. Defie a class variable y i {, +} which deotes the class of x i ad let w (w, w 2, w 3 ). The max-margi SVM classifier solves the followig problem arg mi w,b 2 w 2 y i (w φ(x i ) b), i,.., Usig the method of Lagrage multipliers show that the solutio is ŵ (0, 0, 2), b ad the margi is / ŵ. 5. What happes if we chage the costraits to Solutio:. No. y i (w φ(x i ) b) β, β 2. The poits are mapped to (, 0, 0), (, 2, ), (, 2, ), respectively. The poits are ow separable i 3-dimesioal space. A separatig hyperplae is give by the weight vector (0, 0, ) First otice that all of the three poits are support vectors. Therefore: arg mi w,b 2 w 2 y i (w φ(x i ) b), i, 2, 3 L(w, α) 2 w 2 L(w, α) w L(w, α) b i,2,3 w α i (y i (w φ(x i ) b) ) i,2,3 i,2,3 α i y i φ(x i ) 0 α i y i 0 9

10 Therefore: 5. which would give us the desired result. 6 Bibliographical otes w + α α 2 α 3 0 w 2 + 2α 2 2α 3 0 w 3 + α 2 α 3 0 α α 2 α 3 0 Some ituitios are from David Forsyth s ad Feg Liag s classes at UIUC. Peter Bartlett s class otes provided a very good summary of the mai poits. 7 Some Aswers 7. Aswer to example First part: The aswer is ispired from the formulatio i []. Based o the defiitios we have λj ψ j (x), ψ k (x) K ϕ j (x), λ k ϕ k (x) K K(x, t)ϕ j (t)p (dt), λ k ϕ k (x) projectios λj λk λj λk λj X X X ϕ j (t) K(x, t), ϕ k (x) K P (dt) ϕ j (t)ϕ k (t)p (dt) λk λj ϕ j (t), ϕ k (t) L 2 (X,P ) K RKHS property λk λj δ k,j ϕ j :orthoormal δ k,j 7..2 Secod part: We cosider the ball of K: F λ {f H K : f K } 0

11 Here I am just reviewig the procedure itroduced for fidig the risk for this, R (F λ (X )) sup f: f K λ E σ σ i f(x i ) (7) i sup f: f K λ E σ σ i f, K Xi K (8) i sup f: f K λ E σ f, σ i K Xi (9) i λ E σ σ i K Xi (0) K i λ K Xi 2 K () i λ K Xi, K Xi K (2) i Now we first simplify K Xi, K Xi ad plug i the results i the above boud. But before that, we use the result we foud i the previous part. Previously we proved that, ψ i, ψ j K δ i,j, we ca use this result: ψ i, ψ j K λ i λ j ϕ i, ϕ j K δ i,j ϕ i, ϕ j K Usig this result, we simplify K Xi, K Xi i Equatio K Xi, K Xi λ j ϕ j (X i )ϕ(x), λ k ϕ k (X i )ϕ(x) i i j + + i j k + + i j k + i j k λ j λ k ϕ j (X i )ϕ k (X i ) ϕ(x), ϕ(x) K λ j λ k λj λ k ϕ j (X i )ϕ k (X i )δ j,k λ j ϕ 2 j(x i ) K (3) λi λ j δ i,j (4) K Now we plug this result i the boud we foud i Equatio 7, with λ (the uit ball). R (F λ (X )) + λ j ϕ 2 j (X i) i j

12 Now we take expectatio with respect to samples, ER (F λ (X )) E + i j E + + i j + i j + λ j i j j λ j + λ j j λ j ϕ 2 j (X i) λ j ϕ 2 j (X i) [ ] λ j E ϕ 2 j (X i) Which gives the desired result. Refereces [] Felipe Cucker ad Dig Xua Zhou. Learig theory: a approximatio theory viewpoit. Number 24. Cambridge Uiversity Press,

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it