Lecture 6: Support Vector Machines

Size: px

Start display at page:

Download "Lecture 6: Support Vector Machines"

Ferdinand Copeland
5 years ago
Views:

1 Lecture 6: Support Vector Machnes Marna Melă Department of Statstcs Unversty of Washngton November, 2018

2 Lnear SVM s The margn and the expected classfcaton error Maxmum Margn Lnear classfers Lnear classfers for non-lnearly separable data Non lnear SVM The kernel trck Kernels Predcton wth SVM Extensons L 1 SVM Mult-class and One class SVM SV Regresson Readng HTF: Ch , Murphy: Ch 14 (14.1, kernels, 14.4 and equatons (14.28,14.29) kernel trck, Support Vector Machnes) Addtonal Readng: C. Burges - A tutoral on SVM for pattern recognton These notes: Appendces (convex optmzaton) are optonal.

3 A VC bound h[1 + log(2n/h)] + log(4/δ) L 01 (θ) ˆL 01 (θ) + N }{{} R(h) wth h = VCdm F and δ < 1 the confdence. w.p. > 1 δ (1) A lnear classfer s denoted as f (x; w, b) = w T x + b, where x takes label equal to sgn(f (x; w, b)). The margn of f on data pont x s as usual equal to y f (x ; w, b).

4 The margn and the expected classfcaton error The followng two theorems suggest that large margn s a predctor of good generalzaton error. Theorem Let F ρ be the class of hyperplanes f (x) = w T x, x, w R n, that are ρ away from any data pont 1 n the tranng set D. Then, ( ) VCdm F ρ 1 + mn n, R2 D ρ 2 where R D s the radus of the smallest ball that encloses the dataset. Theorem Let F = {sgn (w T x), w Λ, x R} and let ρ > 0 be any margn. Then for any f F, w.p 1 δ over tranng sets ( L 01 (f ) ˆL c R ρ + 2 Λ 2 N ρ 2 ln N 2 + ln 1 ) (3) δ where c s a unversal constant and ˆL ρ s the fracton of the tranng examples for whch y w T x < ρ (4) A data pont that satsfes (4) for some ρ s called a margn error. For ρ = 0 the margn error rate ˆL ρ s equal to ˆL 01. Note that ρ has a dfferent meanng n the two Theorems above. (2) 1 In other words, a set D s shattered only f all the lnear classfers pass at least ρ away from ts ponts.

5 Maxmum Margn Lnear classfers Support Vector Machnes appeared from the convergence of Three Good Ideas Assume (for the moment) that the data are lnearly separable. Then, there are an nfnty of lnear classfers that have ˆL 01 = 0. Whch one to choose? Frst dea Select the classfer that has maxmum margn ρ on the tranng set. By SRM, we should choose the (w, b) parameters that mnmze ˆL(w, b) + R(h w,b ), where h w,b s gven by (2): For any parameters (w, b) that perfectly classfy the data ˆL(w, b) = 0. Among these, the best (w, b) s the one that mnmzes R(h w,b ) R(h) ncreases wth h, and h w,b decreases when ρ ncreases Hence, by SRM we should choose argmax ρ,w,b:ˆl(w,b)=0 ρ, s.t. d(x, H w,b ) ρ for = 1 : N, (5) where d() denotes the Eucldean dstance and H w,b = { x w T x + b = 0} s the decson boundary of the lnear classfer. Because d(x, H w,b ) = w T x+b w argmax ρ, s.t. ρ,w,b:ˆl(w,b)=0 (proof n a few sldes) (5) becomes w T x + b w ρ for = 1 : N, (6)

6 Maxmum Margn Lnear classfers We contnue to transform (6) If all data correctly classfed, then y (w T x + b) = w T x + b. Therefore (6) has the same soluton as argmax ρ, s.t. ρ,w,b y (w T x + b) w ρ for = 1 : N, (7) Note now that the problem (7) s underdetermned. Settng w Cw, b Cb wth C > 0 does not change anythng. We add a cleverly chosen constrant to remove the ndetermnacy; ths s w = 1/ρ, whch allows us to elmnate the varable ρ. We get argmax w,b 1 w, s.t. y (w T x + b) 1 for = 1 : N, (8) Note: the successve problems (5),(6),(7),... are equvalent n the sense that ther optmal soluton s the same.

7 Alternatve dervaton of (8) Frst dea Select the classfer that has maxmum margn on the tranng set, by the alternatve defnton of margn. Formally, defne mn =1:N y f (x ) be the margn of classfer f on D. Let f (x) = w T x + b, and choose w, b that maxmze w R n,b R mn =1:N y (w T x + b) Remarks (f data s lnearly separable), there exst classfers wth margns > 0 one can arbtrarly ncrease the margn of such a classfer by multplyng w and b by a postve constant. Hence, we need to normalze the set of canddate classfers by requrng nstead maxmze mn =1:N d(x, H w,b), s.t. y (w T x + b) 1 for = 1 : N, (9) where d() denotes the Eucldean dstance and H w,b = { x w T x + b = 0} s the decson boundary of the lnear classfer. Under the condtons of (9), because there are ponts for whch w T x + b = 1, maxmzng d(x, H w,b ) over w, b for such a pont s the same as max w,b 1 w, s.t. mn y (w T x + b) = 1 (10)

8 Second dea The Second dea s to formulate (8) as a quadratc optmzaton problem. 1 mn w,b 2 w 2 s.t y (w T x + b) 1 for all = 1 : N (11) Ths s the Lnear SVM (prmal) optmzaton problem Ths problem has a strongly convex objectve w 2, and constrants y (w T x + b) lnear n (w, b). Hence ths s a convex problem, and can be studed wth the tools of convex optmzaton.

9 The dstance of a pont x to a hyperplane H w,b d(x, H w,b ) = w T x + b w (12) Intuton: denote w = w w, b = b w, x = w T x. (13) Obvously H w,b = H w, b, and x s the length of the projecton of pont x on the drecton of w. The dstance s measured along the normal through x to H; note that f x = b then x H w,b and d(x, H w,b ) = 0; n general, the dstance along ths lne wll be x ( b).

10 Optmzaton wth Lagrange multplers 2 The Lagrangean of (11) s L(w, b, α) = 1 2 w 2 α [y (w T x + b) 1]. (14) [KKT condtons] At the optmum of (11) w = α y x wth α 0 (15) and b = y w T x for any wth α > 0. Support vector s a data pont x such that α > 0. Accordng to (15), the fnal decson boundary s determned by the support vectors (.e. does not depend explctly on any data pont that s not a support vector). 2 The dervatons of these results are n the Appendx

11 Dual SVM optmzaton problem Any convex optmzaton problem has a dual problem. In SVM, t s both llumnatng and practcal to solve the dual problem. The dual to problem (11) s max α 1:N α 1 α α j y y j x T x j s.t α 0 for all and 2 α y = 0. (16) Ths s a quadratc problem wth N varables on a convex doman. Dual problem n matrx form Denote α = [α ] =1:N, y = [y ] =1:N, G j = x T x j, Ḡ j = y y j x T x j, G = [G j ] R N N, Ḡ = [Ḡj ] R N N. max α R N 1T α 1 2 αt Ḡα s.t α 0 and y T α = 0. (17) g(α) = 1 T α 1 αt Ḡα s the dual objectve functon 2 G s called the Gram matrx of the data. Note that Ḡ = y T Gy. At the dual optmum α > 0 for constrants that are satsfed wth equalty,.e. tght α = 0 for the slack constrants

12 Non-lnearly separable problems and ther duals The C-SVM mnmze w,b,ξ 1 2 w 2 + C ξ (18) s.t. y (w T x + b) 1 ξ ξ 0 In the above, ξ are the slack varables. Dual 3 : two types of SV maxmze α s.t. α 1 α α j y y j x T x j (19) 2 C α 0 for all α y = 0 α < C data pont x s on the margn y (w T x + b) = 1 (orgnal SV) α = C data pont x cannot be classfed wth margn 1 (margn error) y (w T x + b) < 1 3 Lagrangean L(w, b, ξ, α, µ) = 1 2 w 2 + C ξ α [y (w T x + b) 1 + ξ ] µ ξ wth α 0, ξ 0, µ 0

13 The ν-svm mnmze w,b,ξ,ρ 1 2 w 2 νρ + 1 N ξ (20) s.t. y (w T x + b) ρ ξ (21) ξ 0 (22) ρ 0 (23) where ν [0, 1] s a parameter. Dual 4 : maxmze α 1 α α j y y j x T x j (24) 2 1 s.t. N α 0 for all (25) α y = 0 (26) Propertes If ρ > 0 then: α ν (27) ν s an upper bound on #margn errors/n (f α = ν) ν s a lower bound on #(orgnal support vectors + margn errors)/n ν-svm leads to the same w, b as C-SVM wth C = 1/ν 4 Lagrangean L(w, b, ξ, ρ, α, µ, δ) = 2 1 w 2 νρ + N 1 α 0, δ 0, µ 0 ξ α [y (w T x + b) ρ + ξ ] µ ξ δρ wth

14 A smple error bound [ ] #support vectors of fn+1 L 01 (f N ) E N + 1 where f N denotes the SVM traned on a sample of sze N. Exercse Use the Homework 5 Problem 3 to prove ths result. (28)

15 Non-lnear SVM How to use lnear classfer on data that s not lnearly separable? An old trck 1. Map the data x 1:N to a hgher dmensonal space x z = φ(x) H, wth dm H >> n. 2. Construct a lnear classfer w T z + b for the data n H In other words, we are mplementng the non-lnear classfer f (x) = w T φ(x) + b = w 1 φ 1 (x) + w 2 φ 2 (x) w mφ m(x) + b (29)

16 Example Data {(x, y)} below are not lnearly separable We map them to 3 dmensons by x y z z = φ(x) = [x 1 x 2 x 1 x 2 ]. Now the classes can be separated by the hypeplane z 3 = 0 (whch happens to be the maxmum margn hyperplane). Hence, w = [ ] (a vector n H) b = 0 and the classfcaton rule s f (φ(x)) = w T φ(x) + b. If we wrte f as a functon of the orgnal x we get a quadratc classfer. f (x) = x 1 x 2

17 Non-lnear SV problem Prmal problem mnmze 1 2 w 2 s.t y (w T φ(x ) + b) 1 0 for all. Dual problem max α 1:N α 1 2 α α j y y j φ(x ) T φ(x j ) }{{} Ḡ j Ḡ j has been redefned n terms of φ Dual problem Same as (17)! s.t. α 0 for all and y α = 0 (30) G j = φ(x ) T φ(x j ) and Ḡ = y T Gy (31) max α 1T α 1 2 αt Ḡα s.t. α 0, y T α = 0 (32)

18 The Kernel Trck Thrd dea The result (32) s the celebrated kernel trck of the SVM lterature. We can make the followng remarks. 1. The φ vectors enter the SVM optmzaton problem only trough the Gram matrx, thus only as the scalar products φ(x ) T φ(x j ). We denote by K(x, x ) the functon K(x, x ) = K(x, x) = φ(x) T φ(x ) (33) K s called the kernel functon. If K can be computed effcently, then the Gram matrx G can also be computed effcently. Ths s exactly what one does n practce: we choose φ mplctly by choosng a kernel K. Hereby we also ensure that K can be computed effcently. 2. Once G s obtaned, the SVM optmzaton s ndependent of the dmenson of x and of the dmenson of z = φ(x). The complexty of the SVM optmzaton depends only on N the number of examples. Ths means that we can choose a very hgh dmensonal φ wthout any penalty on the optmzaton cost. 3. Classfyng a new pont x. As we know, the SVM classfcaton rule s f (x) = w T φ(x) + b = N α y φ(x ) T φ(x) = =1 N α y K(x, x) (34) =1 Hence, the classfcaton rule s expressed n terms of the support vectors and the kernel only. No operatons other than scalar product are performed n the hgh dmensonal space H.

19 Kernels The prevous secton shows why SVMs are often called kernel machnes. If we choose a kernel, we have all the benefts of a mappng n hgh dmensons, wthout ever carryng on any operatons n that hgh dmensonal space. The most usual kernel functons are K(x, x ) = (1 + x T x ) p the polynomal kernel of degree p K(x, x ) = tanh(σx T x β) the neural network kernel K(x, x ) = e x x 2 σ 2 the Gaussan or radal bass functon (RBF) kernel t s φ s -dmensonal

20 The Mercer condton How do we verfy that a chosen K s s a vald kernel,.e that there exsts a φ so that K(x, x ) = φ(x) T φ(x )? Ths property s ensured by a postvty condton known as the Mercer condton. Mercer condton Let (X, µ) be a fnte measure space. A symmetrc functon K : X X, can be wrtten n the form K(x, x ) = φ(x) T φ(x ) for some φ : X H R m ff X 2 K(x, x )g(x)g(x )dµ(x)dµ(x ) 0 for all g such that g(x) L2 < (35) In other words, K must be a postve semdefnte operator on L 2. If K satsfes the Mercer condton, there s no guarantee that the correspondng φ s unque, or that t s fnte-dmensonal.

21 Quadratc kernel C-SVM, polynomal degree 2 kernel, N = 200, C = The two ellpses show that a constant shft to the data (x x + v, v R n ) can affect non-lnear kernel classfers.

22 RBF kernel and Support Vectors

23 Predcton wth SVM Estmatng b For any support vector, w T x + b = y because the classfcaton s tght Alternatvely, f there are slack varables, w T x + b = y (1 ξ ) Hence, b = y (1 ξ ) w T x For non-lnear SVM, where w s not known explctly, w = j α j y j φ(x j ). Hence, b = y (1 ξ ) N j=1 α j y j K(x, x j ) for any support vector Gven new x ( N ) ŷ(x) = sgn(w T x + b) = sgn α y K(x, x) + b. (36) =1

24 L1-SVM If the regularzaton w 2, based on l 2 norm, s replaced wth the l 1 norm w 1, we obtan what s known as the Lnear L1-SVM mn w,b w 1 + C ξ s.t y (w T x + b) 1 ξ, ξ 0 for all = 1 : N (37) The use of the l 1 norm promotes sparsty n the entres of w The Non-lnear L1-SVM s f (x) = (α + + α )y K(x, x) + b classfer (38) (α + + α ) + C ξ s.t y f (x ) 1 ξ, ξ, α ± 0 for all = 1 (39) : N mn α ±,b Ths formulaton enforces α + = 0 or α = 0 for all. If we set w = α + α, we can wrte f (x) = w y K(x, x) + b, a lnear classfer n the non-lnear features K(x, x). The L1-SVM problems are Lnear Programs The dual L1-SVM problems are also lnear programs The L1-SVM s no longer a Maxmum Margn classfer

25 Mult-class and One class SVM Multclass SVM For a problem wth K possble classes, we construct K separatng hyperplanes wr T x + br = 0. mnmze 1 2 K w r 2 + C ξ,r (40) N r=1,r s.t. w T y x + b y w T r x + b r + 1 ξ,r for all = 1 : N, r y (41) ξ,r 0 (42) One-class SVM Ths SVM fnds the support regons of the data, by separatng the data from the orgn by a hyperplane. It s mostly used wth the Gaussan kernel, that projects the data on the unt sphere. The formulaton below s dentcal to the ν-svm where all ponts have label 1. mnmze 1 2 w 2 νρ + 1 N ξ (43) s.t. w T x + b ρ ξ (44) ξ 0 (45) ρ 0 (46)

26 SV Regresson The dea s to construct a tolerance nterval of ±ɛ around the regressor f and to penalze data ponts for beng outsde ths tolerance margn. In words, we try to construct the smoothest functon that goes wthn ɛ of the data ponts. mnmze 1 2 w 2 + C (ξ + + ξ ) (47) s.t. ɛ + ξ + w T x + b y ɛ ξ (48) ξ ± 0 (49) ρ 0 (50) The above problem s a lnear regresson, but wth the kernel trck we obtan a kernel regressor of the form f (x) = (α α + )K(x, x) + b

27 Convex optmzaton n a nutshell A set D R n s convex ff for every two ponts x 1, x 2 D the lne segment defned by x = tx 1 + (1 t)x 2, t [0, 1] s also n D. A functon f : D R s convex ff, for any x 1, x 2 D and for any t [0, 1] for whch tx 1 + (1 t)x 2 D the followng nequalty holds f (tx 1 + (1 t)x 2 ) tf (x 1 ) + (1 t)f (x 2 ) (51) If f s convex, then the set { x f (x) c } s convex for any value of c. Convex functons defned on convex sets have very nterestng propertes whch have engendered the feld called convex optmzaton. The optmzaton problem mn x f 0 (x) (52) s.t. f (x) 0 for = 1,... m s a convex optmzaton problem f all the functons f, f are convex. Note that n ths case the feasble doman A = { x f (x) 0 } s a convex set.

28 It s known that f A has a non empty nteror then the convex optmzaton problem has at most one optmum x. If A s also bounded, x always exsts. Assumng that x exsts, there are two possble cases: (1) The unconstraned mnmum of f 0 les n A. In ths case, the optmum can be found by solvng the equatons f 0 = 0. (2) The unconstraned mnmum of f x 0 les outsde A. Fgure 1 depcts what happens at the optmum x n ths case.

29 g 2 =0 f(x)=c g(x)=0 g 1 =0 grad f f(x)=c grad f g 3 =0 grad g grad g 1 g 4 =0 f(x)=c f(x)=c grad g 4 Fgure: (a) One constrant optmzaton. (b) Four constrant optmzaton. At the optmum only constrants g 1, g 4 are actve. f denotes the objectve (f 0 n text) and g denote the constrants (f n text). Assume there s only one constrant f 1. The doman A s the nsde of the curve f 1 (x) = 0. The optmum x s the pont where a level curve f 0 (x) = c s tangent to f 1 = 0 from the outsde. In ths pont, the gradents of two curves le along the same lne, pontng n opposte drectons. Therefore, we can wrte f 0 = α f 1 x x. Equvalently, we have that at x f, 0 x + α f 1 = 0. Note that ths s a necessary but x not a suffcent condton. The above set of equatons represents the Karush-Kuhn-Tucker optmalty condtons (KKT).

30 Wth more than one constrant, the KKT condtons are equvalent to requrng that the gradent of f 0 les n the subspace spanned by the gradents of the constrants. f 0 x = α f x wth α 0 for all (53) Note that f a certan constrant f does not partcpate n the boundary of D at x,.e f the constrant s not actve, the coeffcent α should be 0. Equaton (53) can be rewrtten as x [f 0(x) + α f (x)] = 0 for some α 0 for = 1,... m (54) }{{} L(x,α) The optmum x has to satsfy the equaton above. The new functon L(x, α) s the Lagrangean of the problem and the varables α are called Lagrange multplers. The Lagrangean s convex n x and affne (.e lnear + constant) n α.

31 The dual problem Defne the functon g(α) = nf x L(x, α) α = (α ), α 0 (55) In the above, the nfmum s over all the values of x for whch f 0, f are defned, not just A (but everythng stll holds f the nfmum s only taken over A). Two facts are mportant about g g(α) L(x, α) f (x) for any x A, α 0,.e g s a lower bound for f 0, and mplctly for the optmal value f 0 (x ), for any value of α 0. g(α) s concave (.e g(α) s convex). We also can derve from (54) that f x exsts then for an approprate value α we have g(α ) = L(x, α ) = f 0 (x ) + 0 (56) and therefore g(α ) must be the unque maxmum of g(α). The second term n L above s zero because x s on the boundary of A; hence for the actve constrants f (x ) = 0 and for the nactve constrants α = 0.

32 Ths surprsng relatonshp shows that by solvng the dual problem max g(α) (57) s.t α 0 we can obtan the values α that plugged nto (53 wll allow us to fnd the soluton x to our orgnal (prmal) problem. The constrants of the dual are smpler than the constrants of the prmal. In practce, t s surprsngly often possble to compute the functon g(α) explctly. Below we gve a smple example thereof. Ths s also the case of the SVM optmzaton problem, whch wll be dscussed n secton 5.

33 A smple optmzaton example Take as an example the convex optmzaton problem mn 1 2 x2 s.t x (58) By nspecton the soluton s x = 1. Let us now apply to t the convex optmzaton machnery. We have L(x, α) = 1 2 x2 + α(x + 1) (59) defned for x R and α 0. g(α) = nf x = nf x [ ] 1 2 x2 + α(x + 1) [ 1 2 (x + α)2 1 ] 2 α2 + α (60) (61) The dual problem s = 1 2 α2 + α (62) = 1 α(2 α) attaned for x = α (63) 2 max 1 2 α(2 α) s.t α 0 (64) and ts soluton s α = 1 whch, usng equaton (63) leads to x = 1. From the KKT condton L x = x + α = 0 (65) we also obtan x = α = 1.

34 Fgure 2 depcts the functon L. Note that L s convex n x (a parabola) and that along the α axs the graph of L conssts of lnes. The areas of L that fall outsde the admssble doman x 1, α 0 are n flat (green) color. The crossecton L(x, α = 0) represents the plot of f. The constraned mnmum of f s at x = 1, the unconstraned one s at x = 0 outsde the admssble doman. Note that g(α) = L( α, α) s concave, and that n the admssble doman t s always below the graph of f. The (red) dot s the optmum (x, α ), whch represents a saddle pont for h. The lne L(x = 1, α) s horzontal (because f 1 = x + 1 = 0) and thus L(x, α ) = L(x, ) = f (x ). Fgure: The surface L(x, α) for the problem mn 1 2 x 2 s.t x

35 The SVM soluton by convex optmzaton The SVM optmzaton problem mn w 1 2 w 2 s.t. y (w T x + b) 1 for all (66) s a convex (quadratc) optmzaton problem where f 0 (w, b) = 1 2 w 2 (67) f (w, b) = y w T x + 1 y b (68) Hence, L(w, b, α) = 1 2 w 2 + α [1 y b y x T w] (69) Equatng the partal dervatves of h w.r.t w, b wth 0 we get L w = w α y x (70) L b = α y (71) or, equvalently w = α y x 0 = α y (72) Hence, the normal w to the optmal separatng hyperplane s a lnear combnaton of data ponts.

36 Sparsty of soluton Moreover, we know that only those α correspondng to actve constrants wll be non-zero. In the case of SVM, these represent ponts that are classfed wth y(w T x + b) = 1. We call these ponts support ponts or support vectors. The soluton of the SVM problem does not depend on all the data ponts, t depends only on the support vectors and therefore s sparse. Computng the soluton. SVM solvers use the dual problem to compute the soluton. Below we derve the dual for the SVM problem. g(α) s computed explctly by replacng equaton (72) n (69). After a smple calculaton we obtan g(α) = N α 1 N N y y j x T x j α α j (73) 2 =1 =1 j=1 or, n vector/matrx notaton g(α) = 1 T α 1 2 αt Gα (74) where G = [G j ] j = [y y j x T x j ] j.

37 A smple SVM problem Data: 4 vectors n the plane and ther labels x 1 = ( 2, 2) y 1 = +1 x 2 = ( 1, 1) y 2 = +1 x 3 = (1, 1) y 3 = 1 x 4 = (2, 2) y 4 = 1 The Gramm matrx G = [x T x j ],j=1:l G = The dual functon to be maxmzed (subject to α g(α) = α 1 α α j y y j x T x j 2 0) s = α 1 + α 2 + α 3 + α 4 4α 2 1 α2 2 α2 3 4α2 4 4α 1α 3 4α 2 α 4 = (2α 1 + α 3 ) (2α 1 + α 3 ) 2 α 1 +(α 2 + 2α 4 ) (α 2 + 2α 4 ) 2 α 4 The parts dependng on α 1, α 3 and α 2, α 4 can be maxmzed separately.

38 After some short calculatons we obtan: α 1 = 0 α 4 = 0 α 2 = 1 2 α 3 = 1 2 Hence, the support vectors are x 2 and x 3. From these, we obtan w = α y x = 1 2 (x 2 x 3 ) = ( 1, 0) b = y 2 w T x 2 = 0 The results are depcted n the fgure below: x2 + x3 - margn w separatng hyperplane x1 + - x4

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed