Support Vector Machine I

Size: px

Start display at page:

Download "Support Vector Machine I"

Marcus Jordan
5 years ago
Views:

1 Support Vector Machine I Statistical Data Analysis with Positive Definite Kernels Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University for Advanced Studies October 6-10, 2008, Kyushu University

2 Outline A quick course on convex optimization Convexity and convex optimization Dual problem for optimization Dual problem and support vectors Sequential Minimal Optimization (SMO) Other approaches 2 / 35

3 A quick course on convex optimization Convexity and convex optimization Dual problem for optimization Dual problem and support vectors Sequential Minimal Optimization (SMO) Other approaches 3 / 35

4 Convexity I For the details on convex optimization, see [BV04]. Convex set: A set C in a vector space is convex if for every x, y C and t [0, 1] tx + (1 t)y C. Convex function: Let C be a convex set. f : C R is called a convex function if for every x, y C and t [0, 1] f(tx + (1 t)y) tf(x) + (1 t)f(y). Concave function: Let C be a convex set. f : C R is called a concave function if for every x, y C and t [0, 1] f(tx + (1 t)y) tf(x) + (1 t)f(y). 4 / 35

5 Convexity II convex set non-convex set convex function concave function 5 / 35

6 Convexity III Fact: If f : C R is a convex function, the set is a convex set for every α R. {x C f(x) α} If f t (x) : C R (t T ) are convex, then is also convex. f(x) = sup t T f t (x) α 6 / 35

7 Convex optimization I A general form of convex optimization f(x), h i (x) (1 i l): D R, convex functions on D R n. a i R n, b j R (1 j m). min f(x) x D subject to { h i (x) 0 a T j x + b j = 0 h i : inequality constraints, r j (x) = a T j x + b j: linear equality constraints. Feasible set: (1 i l), (1 j m). F = {x D h i (x) 0 (1 i l), r j (x) = 0 (1 j m)}. The above optimization problem is called feasible if F. 7 / 35

8 Convex optimization II Fact 1. Fact 2. The feasible set is a convex set. The set of minimizers X opt = { x F f(x) = inf{f(y) y F} } is convex. No local minima for convex optimization. proof. The intersection of convex sets is convex, which leads (1). Let p = inf x F f(x). Then, X opt = {x D f(x) p } F. Both sets in r.h.s. are convex. This proves (2) 8 / 35

9 Examples Linear program (LP) min c T x subject to { Ax = b, Gx h. 1 The objective function, the equality and inequality constraints are all linear. Quadratic program (QP) min 1 2 xt P x + q t x + r subject to where P is a positive semidefinite matrix. Objective function: quadratic. Equality, inequality constraints: linear. { Ax = b, Gx h, 1 Gx h denotes g T j x h j for all j, where G = (g 1,..., g m) T. 9 / 35

10 A quick course on convex optimization Convexity and convex optimization Dual problem for optimization Dual problem and support vectors Sequential Minimal Optimization (SMO) Other approaches 10 / 35

11 (primal) min f(x) x D Lagrange dual Consider an optimization problem (which may not be convex): { h i (x) 0 (1 i l), subject to r j (x) = 0 Lagrange dual function: g : R l R m [, ) where g(λ, ν) = inf L(x, λ, ν), x D L(x, λ, µ) = f(x) + l i=1 λ ih i (x) + m j=1 ν jr j (x). λ i and ν j are called Lagrange multipliers. g is a concave function. (1 j m). 11 / 35

12 Geometric interpretation of dual function G = {(u, v, t) = (h(x), r(x), f(x)) R l R m R x R n }. Omit r(x) and v for simplicity. For λ 0, g(λ) = inf x λ T h(x) + f(x) The hyperplane = inf{t + λ T u (u, t) G}. t + λ T u = b intersects t-axis at b = g(λ). g(λ) is the smallest t-intercept among all the hyperplanes intersecting G with the fixed normal λ. (λ,1) g(λ) t b p* G λu+ t = g(λ) λu+ t = b u 12 / 35

13 Dual problem Dual problem and weak duality I (dual) max g(λ, ν) subject to λ 0. The dual and primal problems have close connection. Theorem 1 (weak duality) Let and Then, p = inf{f(x) h i (x) 0 (1 i l), r j (x) = 0 (1 j m)}. d = sup{g(λ, ν) λ 0, ν R m }. d p. The weak duality does not require the convexity of the primal optimization problem. 13 / 35

14 Dual problem and weak duality II Proof. Let λ 0, ν R m. For any feasible point x, L(x, λ, ν) = f(x) + l i=1 λ ih i (x) + m j=1 ν jr j (x) f(x). (The second term is non-positive, and the third term is zero.) By taking infimum, Thus, inf L(x, λ, ν) x:feasible p. g(λ, ν) = inf L(x, λ, ν) inf L(x, λ, ν) p x D x:feasible for any λ 0, ν R m. 14 / 35

15 Strong duality I We need some conditions to obtain the strong duality d = p. Convexity of the problem: f and h i are convex, r j are linear. Slater s condition There is x relintd such that h i ( x) < 0 (1 i l), r j ( x) = a T j x+b j = 0 (1 j m). Theorem 2 (Strong duality) Suppose the primal problem is convex, and Slater s condition holds. Then, there is λ 0 and ν R m such that g(λ, ν ) = d = p. Proof is omitted (see [BV04] Sec ). There are also other conditions to guarantee the strong duality. 15 / 35

16 Strong duality II p = inf{t (u, t) G, u 0} (v omitted) g(λ) = inf{λ T u + t (u, t) G} d = sup{g(λ) λ 0} t t λu+ t = g(λ ) G λu+ t = g(λ 1 ) λu+ t = g(λ ) p* G p* = d* u d* u strong duality duality gap 16 / 35

17 Complementary slackness I Consider the (not necessarily convex) optimization problem: { h i (x) 0 (1 i l), min f(x) subject to r j (x) = 0 (1 j m). Assumption: the optimum of the primal/dual problems are given by x and (λ, ν ) (λ 0), and they satisfy the strong duality; Observation: g(λ, ν ) = f(x ). f(x ) = g(λ, ν ) = inf x D L(x, λ, ν ) L(x, λ, ν ) [definition] = f(x ) + l i=1 λ i h i (x ) + m j=1 ν j r j (x ) f(x ) [2nd 0 and 3rd = 0] The two inequalities are in fact equalities. 17 / 35

18 Complementary slackness II Consequence 1: x minimizes L(x, λ, ν ) (Primal solution by unconstrained optimization) Consequence 2: λ i h i (x ) = 0 for all i The latter is called complementary slackness. Equivalently, λ i > 0 h i (x ) = 0, or h i (x ) < 0 λ i = / 35

19 KKT condition I KKT conditions give useful relations between the primal and dual solutions. Consider the convex optimization problem. Assume D is open and f(x), h i (x) are differentiable. { h i (x) 0 (1 i l), min f(x) subject to r j (x) = 0 (1 j m). x and (λ, ν ): any optimal points of the primal and dual problems. Assume the strong duality holds. From Consequence 1 (x = arg min L(x, λ, ν )), f(x ) + l i=1 λ i g i (x ) + m j=1 ν j r j (x ) = / 35

20 KKT condition II The following are necessary conditions. Karush-Kuhn-Tucker (KKT) conditions: h i (x ) 0 (i = 1,..., l) [primal constraints] r j (x ) = 0 (j = 1,..., m) [primal constraints] λ i 0 (i = 1,..., l) [dual constraints] λ i h i (x ) = 0 (i = 1,..., l) [complementary slackness] f(x ) + l i=1 λ i g i (x ) + m j=1 ν j r j (x ) = 0. Theorem 3 (KKT condition) For a convex optimization problem with differentiable functions, x and (λ, ν ) are the primal-dual solutions with strong duality if and only if they satisfy KKT conditions. 20 / 35

21 Example Quadratic minimization under equality constraints. min 1 2 xt P x + q T x + r subject to Ax = b, where P is (strictly) positive definite. KKT conditions: Ax = b, [primal constraint] x L(x, ν ) = 0 = P x + q + A T ν = 0 The solution is given by ( ) ( ) P A T x A O ν = ( ) q. b 21 / 35

22 A quick course on convex optimization Convexity and convex optimization Dual problem for optimization Dual problem and support vectors Sequential Minimal Optimization (SMO) Other approaches 22 / 35

23 Primal problem of SVM The QP for SVM can be solved in the primal form, but the dual form is easier. SVM primal problem: 1 N min i,j=1 w i,b,ξ i 2 w iw j k(x i, X j ) + C N i=1 ξ i, { Y i ( N j=1 subj. to k(x i, X j )w j + b) 1 ξ i, ξ i / 35

24 SVM Dual problem: max α N α i 1 2 i=1 Dual problem of SVM N α i α j Y i Y j K ij i,j=1 subj. to { 0 α i C, N i=1 α iy i = 0 where K ij = k(x i, X i ). Solve it by a QP solver. Note: the constraints are simpler than the primal problem. Derivation [Exercise]. Hint: Compute the Lagrange dual function g(α, β) from L(w, b, ξ, α, β) = 1 N 2 i,j=1 w iw j k(x i, X j ) + C N i=1 ξ i + N i=1 α i{1 Y i ( N j=1 w jk(x i, X j ) + b) ξ i } + N i=1 β i( ξ i ). 24 / 35

25 KKT conditions of SVM KKT conditions (1) 1 Y i f (X i ) ξi 0 ( i), (2) ξi 0 ( i), (3) αi 0, ( i), (4) βi 0, ( i), (5) αi (1 Y if (X i ) ξi ) = 0 ( i), (6) βi ξ i = 0 ( i), n (7) w : j=1 K ijwj n j=1 α j Y jk ij, n b : j=1 α j Y j = 0, ξ : C αi β i = 0 ( i). 25 / 35

26 Solution of SVM SVM solution in dual form n f(x) = α Y i k(x, X i ) + b. i=1 (Use KKT condition (7)). How to solve b? shown later. 26 / 35

27 Complementary slackness Support vectors I α i (1 Y i f (X i ) ξ i ) = 0 ( i), (C α i )ξ i = 0 ( i). If α i = 0, then ξ i = 0, and Y i f (X i ) 1. [well separated] Support vectors If 0 < α i < C, then ξ i = 0 and Y if (X i) = 1. If α i = C, Y if (X i) / 35

28 Support vectors II Sparse representation: the optimum classifier is expressed only with the support vectors. f(x) = i:support vector α i Y i k(x, X i ) + b w support vectors 0 < α i < C (Y i f(x i ) = 1) support vectors α i = C (Y i f(x i ) 1) / 35

29 How to solve b The optimum value of b is given by the complementary slackness. For any i with 0 < α i < C, Y i ( j k(x i, X j )Y j α j + b ) = 1. Use the above relation for any of such i, or take the average over all of such i. 29 / 35

30 A quick course on convex optimization Convexity and convex optimization Dual problem for optimization Dual problem and support vectors Sequential Minimal Optimization (SMO) Other approaches 30 / 35

31 Computational problem in solving SVM The dual QP problem of SVM has N variables, where N is the sample size. If N is very large, say N = 100, 000, the optimization is very hard. Some approaches have been proposed for optimizing subsets of the variables sequentially. Chunking [Vap82] Osuna s method [OFG] Sequential minimal optimization (SMO) [Pla99] SVMlight ( 31 / 35

32 Sequential minimal optimization (SMO) I Solve small QP problems sequentially for a pair of variables (α i, α j ). How to choose the pair? Intuition from the KKT conditions is used. After removing w, ξ, and β, the KKT conditions of SVM are equivalent to N 0 αi C, Y iαi = 0, i=1 αi = 0 Y if (X i) 1, ( ) 0 < αi < C Y if (X i) = 1, αi = C Y if (X i) 1. (see Appendix.) The conditions can be checked for each data point. Choose (i, j) such that at least one of them breaks the KKT conditions. 32 / 35

33 Sequential minimal optimization (SMO) II The QP problem for (α i, α j ) is analytically solvable! For simplicity, assume (i, j) = (1, 2). Constraint of α 1 and α 2 : α 1 + s 12 α 2 = γ, 0 α 1, α 2 C, where s 12 = Y 1 Y 2 and γ = ± l 3 Y lα l is constat. Objective function: α 1 + α α2 1K α2 2K 22 s 12 α 1 α 2 K 12 Y 1 α 1 j 3 Y jα j K 1j Y 2 α 2 j 3 Y jα j K 2j + const. This optimization is a quadratic optimization of one variable on an interval. Directly solved. 33 / 35

34 A quick course on convex optimization Convexity and convex optimization Dual problem for optimization Dual problem and support vectors Sequential Minimal Optimization (SMO) Other approaches 34 / 35

35 Other approaches to optimization of SVM Recent studies (not a compete list). Solution in primal. O. Chapelle [Cha07] T. Joachims, SVM perf [Joa06] S. Shalev-Shwartz et al. [SSSS07] Online SVM. Tax and Laskov [TL03] LaSVM [BEWB05] Parallel computation Cascade SVM [GCB + 05] Zanni et al [ZSZ06] Others Column generation technique for large scale problems [DBS02] 35 / 35

36 References I [BEWB05] Antoine Bordes, Seyda Ertekin, Jason Weston, and Léon Bottou. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6: , [BV04] [Cha07] [DBS02] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, boyd/cvxbook/. Olivier Chapelle. Training a support vector machine in the primal. Neural Computation, 19: , Ayhan Demiriz, Kristin P. Bennett, and John Shawe-Taylor. Linear programming boosting via column generation. Machine Learning, 46(1-3): , / 35

37 References II [GCB + 05] Hans Peter Graf, Eric Cosatto, Léon Bottou, Igor Dourdanovic, and Vladimir Vapnik. Parallel support vector machines: The Cascade SVM. In Lawrence Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems, volume 17. MIT Press, [Joa06] [OFG] Thorsten Joachims. Training linear svms in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), Edgar Osuna, Robert Freund, and Federico Girosi. An improved training algorithm for support vector machines. In Proceedings of the 1997 IEEE Workshop on Neural Networks for Signal Processing (IEEE NNSP 1997), pages / 35

38 References III [Pla99] [SSSS07] [TL03] [Vap82] John Platt. Fast training of support vector machines using sequential minimal optimization. In Bernhard Schölkopf, Cristopher Burges, and Alexander Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages MIT Press, Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In Proc. International Conerence of Machine Learning, D.M.J. Tax and P. Laskov. Online svm learning: from classification to data description and back. In Proceddings of IEEE 13th Workshop on Neural Networks for Signal Processing (NNSP2003), pages , Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, / 35

39 References IV [ZSZ06] Luca Zanni, Thomas Serafini, and Gaetano Zanghirati. Parallel software for training large scale support vector machines on multiprocessor systems. Journal of Machine Learning Research, 7: , / 35

40 Appendix: Proof of KKT condition Proof. x is primal-feasible by the first two conditions. From λ i 0, L(x, λ, ν ) is convex (and differentiable). The last condition x L(x, λ, ν ) = 0 implies x is a minimizer. It follows g(λ, ν ) = inf x D L(x, λ, ν ) [by definition] = L(x, λ, ν ) [x : minimizer] = f(x ) + l i=1 λ i h i (x ) + m j=1 ν j r j (x ) = f(x ) [complementary slackness and r j (x ) = 0]. Strong duality holds, and x and (λ, ν ) must be the optimizers. 40 / 35

41 Appendix: KKT conditions revisited I β and w can be removed by From KKT (4) and (6), ξ : βi = C αi ( i), n w : j=1 K ijwj = n j=1 α j Y j K ij ( i). α i C, ξ i (C α i ) = 0 ( i). The KKT conditions are equivalent to (a) 1 Y if (X i) ξi 0 ( i), (b) ξi 0 ( i), (c) 0 αi C ( i), (d) αi (1 Y if (X i) ξi ) = 0 ( i), (e) ξi (C αi ) = 0 ( i), (f) N i=1 Yiα i = 0. and β i = C αi, n j=1 Kijw j = n j=1 α j Y jk ij. 41 / 35

42 Appendix: KKT conditions revisited II We can further remove ξ. Case α i = 0: From (e), ξ i = 0. Then, from (a), Y if (X i) 1. Case 0 < α i < C: From (e), ξ i = 0. From (d), Y if (X i) = 1. Case α i = C: From (d) and (b), ξ i = 1 Y if (X i) 0. Note in all cases, (a) and (b) are satisfied. The KKT conditions are equivalent to 0 αi C ( i), N i=1 Y iαi = 0, αi = 0 Y if (X i ) 1, (ξi = 0) 0 < αi < C Y if (X i ) = 1, (ξi = 0) αi = C Y if (X i ) 1, (ξi = 1 Y if (X i )). 42 / 35

Support Vector Machines

Support Vector Machines Statistical Inference with Reproducing Kernel Hilbert Space Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University for