Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Size: px

Start display at page:

Download "Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012"

Dora Eaton
5 years ago
Views:

1 Support Vector Machnes Je Tang Knowledge Engneerng Group Department of Computer Scence and Technology Tsnghua Unversty

2 Outlne What s a Support Vector Machne? Solvng SVMs Kernel Trcks 2

3 What s a Support Vector Machne SVM s related to statstcal learnng theory [3] SVM was frst ntroduced n 1992 [1] SVM becomes popular because of ts success n handwrtten dgt recognton 1.1% test error rate for SVM. Ths s the same as the error rates of a carefully constructed neural network, Leet 4. See Secton 5.11 n [2] or the dscusson n [3] for detals SVM s now regarded as an mportant example of kernel methods, one of the key area n machne learnng [1] B.E. Boser et al. A Tranng Algorthm for Optmal Margn Classfers. Proceedngs of the Ffth Annual Workshop on Computatonal Learnng Theory , Pttsburgh, [2] L. Bottou et al. Comparson of classfer methods: a case study n handwrtten dgt recognton. Proceedngs of the 12th IAPR Internatonal Conference on Pattern Recognton, vol. 2, pp , [3] V. Vapnk. The ature of Statstcal Learnng Theory. 2 nd edton, Sprnger,

4 Classfcaton Problem Gven a tranng set S={(x 1, y 1 ),(x 2, y 2 ),,(x, y )}, and x X=R m, =1,2,, To learn a functon g(x), and make the decson functon f(x)=sgn(g(x)) can classfy new nput x So ths s a supervsed batch learnng method Lnear classfer g(x) = (w T x + b) # sgn(g(x)) = 1,g(x) 0 & $ ' % 1,g(x) < 0( f (x) = sgn(g(x)) 4

5 What s a good Decson Boundary? Consder a two-class, lnearly separable classfcaton problem Many decson boundares! The Perceptron algorthm can be used to fnd such a boundary Dfferent algorthms have been proposed Are all decson boundares equally good? Class +1 Class -1 5

6 6 Geometrc Interpretaton

7 Affne Set Lne through x 1 and x 2 : all ponts x =θx 1 + (1 θ)x 2 θ R Affne set: contans the lne through any two dstnct ponts n the set Affne functon: f: R n ->R m s affne f f(x)=ax+b wth A number of the followng pages are from Boyd s sldes A R m n,b R m 7

8 Convex Set Lne segment between x 1 and x 2 : all ponts x =θx 1 + (1 θ)x 2 0 θ 1 Convex set: contans lne segment between any two ponts n the set x 1, x 2 C, 0 θ 1 =>θx 1 + (1 θ)x 2 C Examples (one convex, two nonconvex sets) 8

9 Hyperplanes and Halfspaces Hyperplane: set of the form {x a T x = b}(a 0) Halfspace: set of the form {x a T x b}(a 0) a s the normal vector Hyperplanes are affne and convex; halfspaces are convex. 9

10 Bsector based Decson Boundary k k conv( S) = x = λjx j λj = 1, λj 0, j = 1, L, k j= 1 j= 1 d Class -1 Class +1 c m 10

11 Formalzaton 1 1 mn β c d mnβ β x β x 2 2 st.. β = 1, β = 1 y = 1 y = 1 0 β 1, = [1, m] j j j j y = 1 y = 1 j The objectve s to solve all the β,. Then we can obtan the two ponts havng the closest dstance by c= β x d = β x j j y = 1 y = 1 ext we compute the hyperplane w T x + b = 0 by m 1 w= c d = βyx b= (( c d ) g ( c+ d )) 2 Fnally, we make predcton by f( x) = sgn(( w T x) + b) j 11

12 12 Maxmal Margn

13 Large-margn Decson Boundary The decson boundary should be as far away from the data of both classes as possble We should maxmze the margn, m Dstance between the orgn and the lne w T x=-b s -b/ w Class -1 m = 2γ w Class +1 m T wx+ b= γ 13 T wx+ b= γ

14 Formalzaton maxγ, wb, 2γ w equal to mn γ,w,b w 2γ ote: we have constrants s.t. w T x () + b γ,1 k, y =1 w T x ( j) + b < γ,k < j, y j = 1 equal to y () (w T x () + b) γ,1 Snce we can arbtrarly scale w and b wthout changng anythng, We ntroduce the scalng constrant γ=1 mn w,b w 2 s.t. Change to 2-norm space w 2 y () (w T x () + b) 1,1 2 loss functon Ths s a constraned optmzaton problem. 14

15 Loss Functon Then we result n the Lasso loss functon mn w,b 1 2 w 2 s.t. y () (b + w T x () ) 1 Another popular loss func: Hnge loss + penalty mn w,b [1 y () (b + w T x)] + + λ 2 =1 w 2 15

16 16 Loss Functon (cont.) Emprcal loss functon Structural loss functon mn w,b [1 y () (b + wx () )] + =1 mn w,b [1 y () (b + wx () )] + + λ 2 w 2 =1 where w 2 ndcates the complexty of the model, t s also named as penalty There are many knds of formulaton of the loss functon

17 Optmal Margn Classfers For the problem: w 2 mn w,b 2 s.t. y () (w T x () + b) 1,1 We can wrte lagrangan form: L(w,b,α) = w 2 2 α [ y() (w T x () + b) 1] s.t. α 0,1 WHY? Let us revew generalzed Lagrangan =0 17

18 Revew Convex Optmzaton and Lagrange Dualty 18

19 Convex Functon f: R n ->R s convex f dom(f) s a convex set and f (θx + (1 θ)y) θ f (x) + (1 θ) f ( y) for all x, y dom( f ),0 θ 1 f s concave f f s convex f s strctly convex f dom(f) s convex and f (θx + (1 θ)y) <θ f (x) + (1 θ) f ( y) for all x, y dom( f ), x y,0 θ 1 19

20 Frst-order Condton f s dfferentable f dom(f) s open and the gradent exsts at each # f (x) = % $ f (x) x 1 x dom( f ), f (x) x 2 1 st -order condton: dfferentable f wth convex doman s convex ff f ( y) f (x) + f (x) T *( y x),..., f (x) x n & ( ' for all x, y dom( f ) 20

21 Second-order Condton f s twce dfferentable f dom(f) s open and the Hessan exsts at each 2 nd -order condton: for twce dfferentable f wth convex doman f s convex ff for all 2 f (x) j = 2 f (x) x x j,, j =1,...,, x dom( f ) x dom( f ) 2 f (x) 0 x dom( f ) If 2 f (x) > 0 for all, then f s strctly convex. 21

22 Convex Optmzaton Problem 22 Standard form convex optmzaton problem f 0,f 1,,f k are convex; Equalty constrants are affne Important property: feasble set of a convex optmzaton problem s convex Example mn f ( x) 0 st.. f ( x) 0, = 1,..., k T a x b = 0, = 1,..., l mn f ( x) = x + x st f x ( ) = 2 0 (1 + xx ) x h x x x 2 1( ) = ( 1 + 2) = 0 f 0 s convex; feasble set {(x 1, x 2 ) x 1 = x 2 0} s convex ot a convex problem snce f 1 s not convex, h 1 s not affne

23 Lagrange Dualty When solvng optmzaton problems wth constrants, lagrange dualty s always used to obtan the soluton of the prmal problem through solvng the dual problem. Prmal optmzaton problem If f(x), c (x), h j (x) are contnuously dfferentable functons defned n R n, then the followng optmzaton problem s called prmal optmzaton problem mn f( x) x R n st.. g ( x) 0, = 1,..., k h ( x) = 0, j = 1,..., l j 23

24 24 Prmal Optmzaton Problem To solve the prmal optmzaton problem, we defne the generalzed Lagrangan: Lx (, αβ, ) = f() x αg() x βh() x st.. α 0 where α and β j are Lagrange multplers. Consder the functon: Assume some x volates any of the prmal constrants (.e., f ether g (x)<0 or h j (x) 0 for some ), then we can verfy that Snce f g (x)<0 for some, we can set α as + ; f h j (x) 0 for some, we can set β j h j (x) as +, and set other α and β j as 0. In contrast, f the constrants are ndeed satsfed for a partcular value of x, then Therefore: θ P (x) = max L(x,α,β) α,β:α 0 k j j = 1 j= 1 Here, f we consder the mnmzaton problem l mn f( x) n x R θ ( x) = max [ f( x) α g ( x) β h ( x)] = P j j αβα, : 0 = 1 j= 1 k " $ θ P (x) = # f (x) %$ mn x θ P (x) = mn x l If x satsfes prmal constrants otherwse max α,β:α 0 Here P stands for prmal. L(x,α,β) st.. g ( x) 0, = 1,..., k h ( x) = 0, j = 1,..., l j Prmal problem We see that the prmal problem s represented by the mn max problem of the generalzed Lagrangan,.e., p* = mnθ x P (x)

25 Dual optmzaton problem: Dual Optmzaton Problem max θ α,β:α 0 D (α,β) = max mn L(x,α,β) α,β:α 0 x Ths s exactly the same as the prmal problem, except that the order of the max and the mn are now exchanged. We also defne the optmal value of the dual problem s objectve to be How are the prmal and the dual problems related? It can easly shown that: Proof: d* = max θ α,β:α 0 D (x) d* = max mn α,β:α 0 x L(x,α,β) mn x max α,β:α 0 L(x,α,β) = p * θ D (α,β) = mn L(x,α,β) L(x,α,β) max L(x,α,β) =θ x α,β:α 0 P (x) So θ D (α,β) θ P (x) Because the prmal and dual problem both have the optmal value, thus max θ α,β:α 0 D (α,β) mnθ x P (x) mn f( x) n x R st.. g ( x) 0, = 1,..., k h ( x) = 0, j = 1,..., l j Prmal problem.e., d* = max mn α,β:α 0 x L(x,α,β) mn x max α,β:α 0 L(x,α,β) = p * 25

KKT Condtons 26 Under certan condtons, we wll have d*=p* So that we can solve the dual problem n leu of the prmal problem. Then what s the condtons? Suppose (1) f and g are convex, and h (x) s affne.

26 KKT Condtons 26 Under certan condtons, we wll have d*=p* So that we can solve the dual problem n leu of the prmal problem. Then what s the condtons? Suppose (1) f and g are convex, and h (x) s affne. (2) the constrants g are (strctly) feasble; ths means that there exsts some x so that g (x) >0 for all. Under the above assumptons, there must exst x*, α* and β* so that x* s the soluton to the prmal problem, α*, β* are the soluton to the dual problem and p*=d*=l(x*, α*, β* ). The necessary and suffcent condtons are KKT (Karush-Kuhn-Tucher) condtons: L(x *,α *,β * ) = 0, [1, ] x L(x *,α *,β * ) α L(x *,α *,β * ) β = 0, [1,k] = 0, [1,l] α * g (x * ) = 0, [1,k] g (x * ) 0, [1,k] α * 0, [1,k] KKT dual complementarty condton. If α *>0, then g (x)=0

27 Back to Our Optmal Margn Classfers 27

28 Optmal Margn Classfers For the problem: w 2 mn w,b 2 s.t. y () (w T x () + b) 1,1 We can wrte lagrangan form: L(w,b,α) = w 2 2 α [ y() (w T x () + b) 1] s.t. α 0,1 Then our problem becomes: wb, =0 mn max Lwb (,, α) α If certan constrants are satsfed, then we have max mn Lwb α wb, (,, α) 28

29 Solve the Dual Problem max α mn w,b L(w,b,α) = w 2 s.t. α 0,1 2 α [ y() (w T x () + b) 1] Let us frst solve the nsde mnmal problem by settng the gradent of L(w, b, a) w.r.t. w and b to zero, we have =1 Lwb (,, α) w () () = w α y x = = 1 w= α y x = 1 () () 0 Lwb (,, α) b () = α y = = 1 0 Then let us substtute the two equatons nto L(w, b, a) to solve the maxmal problem 29

30 Solve the Dual Problem ( ) w = α y x ow we have: ( ) =1 and α y ( ) =0 =1 w = α y ( ) x ( ) back to L(w, b, a) Substtute =1 1 ( ) ( ) ( j) ( j) ( ) L(b,α ) = α y x α j y x α [ y ( α j y ( j ) x ( j ) x ( ) + b) 1] 2 =1 j=1 =1 j=1 1 ( ) ( j ) ( ) ( j ) ( ) ( j ) ( j ) ( ) ( ) = α α j y y x x α y α j y x x b α y + α 2 =1 j=1 =1 j=1 =1 =0 1 ( ) ( j ) ( ) ( j ) ( ) = α α j y y x x b α y + α 2 =1 j=1 =1 =1 because α y ( ) = 0, we obtan =1 1 ( ) ( j ) ( ) ( j ) L(α ) = α α j y y x x + α 2 =1 j=1 =1 The new objectve functon s a functon of a only It s known as the dual problem: f we know w, we know all a; vce versa 30

31 The Dual Problem (cont.) The orgnal problem, also known as prmal problem w 2 mn w,b 2 s.t. y () (w T x () + b) 1,1 mn w,b max α L(w,b,α) = w 2 s.t. α 0,1 2 α [ y() (w T x () + b) 1] =1 The dual problem max mn (,, α) α Lwb wb, max α 1 α 2 α j y () y ( j) x () x ( j) + α =1 j=1 s.t. α > 0,1 ; α y = 0 =1 =1 Propertes of a when we ntroduce the Lagrange multplers The result when we dfferentate the orgnal Lagrangan w.r.t. b 31

32 Relatonshp between Prmal and Dual Problems d = max mn L( x, αβ, ) mn max L( x, αβ, ) = p * * αβ, x x αβ, ote: f under some condtons, d*=p* We can solve the dual problem n leu of the prmal problem What s the condtons? The famous KKT condtons (Karush-Kuhn-Tucker condtons) Lw b w * * * (,, α ) Lw b b * * * (,, α ) * * α ( ) ( ) = 1 * ( ) = α y = 0 (2) In our case = 1 * ( ) * ( ) * () * () * y w x + b * = w y x = 0, [1, ] (1) ( ) 1 0, [1, ] 0, [1, ] α ( y ( w x + b ) 1) = 0, [1, ] (3) α What s KKT In Lagrangan formula Lx (, αβ, ) = f() x αg() x βh() x st.. α 0 k KKT condtons are j j = 0 j= 0 l L(x *,α *,β * ) x = 0, [1, ] L(x *,α *,β * ) α L(x *,α *,β * ) β = 0, [1,k] = 0, [1,l] α * g (x * ) = 0, [1,k] g (x * ) 0, [1,k] α * 0, [1,k] 32

33 ow We Have Then, what we have the maxmum optmum problem wth respect to α: 33 Ths s a quadratc programmng (QP) problem, A global maxmum of a can always be found Then solve w by Fnally solve b: Snce there s at least one α j* >0 (f all α j* =0, from equaton (1) we know that w*=0, however w*=0 s not the optmal soluton). Then from equaton (3), we know that Because y (j) y (j) =1, then max α 1 α 2 α j y () y ( j) x () x ( j) + α w = j=1 =1 α y () x () y ( j) (w * x ( j) + b * ) 1= 0 ( j) * ( j) ( j) ( ) ( ) ( j) α = 1 b= y w x = y y x x =1 s.t. α > 0,1 ; α y = 0 =1 =1 Characterstcs of the Soluton Many of the a are zero - w s a lnear combnaton of a small number of data ponts - Ths sparse representaton can be vewed as data compresson as n the constructon of knn classfer x wth non-zero a are called support vectors (SV) - The decson boundary s determned only by the SV

34 A Geometrcal Interpretaton Class -1 α 8 =0.6 α 10 =0 α 5 =0 α 7 =0 α 2 =0 α 4 =0 α 9 =0 Class +1 α 3 =0 α 6 =1.4 α 1 =0.8 34

35 How to Predct For a new sample x, we can predct t by: T ( α () () T ) = 1 wx+ b= y x x+ b = α + = 1 () () y x, x b classfy x as class +1 f the sum s postve, and class -1 otherwse ote: w need not be formed explctly 35

36 on-separable What s non-separable case? Class -1 We allow error ξ n classfcaton; t s based on the output of the dscrmnant functon w T x+b ξ approxmates the number of msclassfed samples Class +1 36

37 on-lnear Cases What s non-lnear case? 37

38 on-separable case The formalzaton of the optmal problem becomes : mn w,b,ξ s.t. w 2 +C ξ =1 y(w T x () + b) 1 ξ,1 ξ 0,1 Thus, examples are now permtted to have margn less than 1, and f an example has functonal margn 1-ξ (wth ξ>0), we would pay a cost of the objectve functon by ncreased by Cξ. The parameter C controls the relatve weghtng between the twn goals of makng the w 2 small and of ensurng that most examples have functonal margn at least 1. 38

39 Lagrangan Soluton Agan, we have the lagrangan form : L(w,b,ξ,α,σ ) = w 2 +C ξ α [ y(w T x () + b) 1+ξ ] σ ξ s.t. σ 0;α 0 max α L(α) = α 1 2 =1, j=1 s.t. C α 0, [1, ] α y () = 0 =1 α = + α () T () 0 y ( w x b) 1 () T () = C y w x + b ( ) 1 () T () 0 < α < C y ( w x + b) = 1 =1 =1 y () y ( j) α α j x (), x ( j) What s the dfference from the separable form??!! + =1 KKT condtons: L(w,b,α) w = 0, [1, ] L(w,b,α) ξ = 0, [1, ] L(w,b,α) = 0 b α ( y () (w T x () + b) 1 ξ ) = 0, [1, ] y () (w T x () + b) 1 ξ 0, [1, ] α 0,σ 0, [1, ] 39

40 How to tran SVM Solvng the quadratc programmng optmzaton problem drectly to tran the SVM s very slow when the tranng data grows large. Sequental mnmal optmzaton (SMO) algorthm, due to John Platt. Frst, let us ntroduce coordnate ascent algorthm: Loop untl convergence: { For =1,, m { a :=argmax a L(a 1,, a -1, a, a +1,, a m ) } } 40

41 Coordnate ascent s ok? Is t suffcent? α 1 y (1) = α 1 = y (1) =2 =2 α y () α y () 41

42 SMO Change the algorthm by: ths s just SMO Repeat untl convergence { 1. select some par a and a j to update next. (usng a heurstc that tres to pck the two that wll allow us to make the bggest progress towards the global maxmum). 2. reoptmze L(a) wth respect to a and a j, whle holdng all the other a. } Many approaches have been proposed -e.g., Loqo, cplex, etc. (see α 1 y (1) +α 2 y (2) = =3 α y () = ς α 1 = (ς α 2 y (2) )y (1) L(a) = L((ς α 2 y (2) )y (1),α 2,...,α ) 42

43 SMO(2) La ( ) = L(( ς α y ) y, α,..., α ) (2) (1) 2 2 m Ths s a quadratc functon n a 2. I.e. t can be wrtten as: aα + bα + c

44 Solvng a 2 aα + bα + c For the quadratc functon, we can smply solve t by settng ts dervatve to zero. Let us use a 2 new, unclpped as the resultng value. H f > H α α α new, unclpped L f( α2 < L) new, unclpped ( α2 ) new new new, unclpped 2 = 2 f ( L 2 H) Havng fnd a 2, we can go back to fnd the optmal a 1. Please read Platt s paper f you want to read more detals 44

45 Thanks! HP: Emal: 45

Lecture 10 Support Vector Machines. Oct

Lecture 10 Support Vector Machnes Oct - 20-2008 Lnear Separators Whch of the lnear separators s optmal? Concept of Margn Recall that n Perceptron, we learned that the convergence rate of the Perceptron