Kernel Method: Data Analysis with Positive Definite Kernels

Size: px

Start display at page:

Download "Kernel Method: Data Analysis with Positive Definite Kernels"

Tyler Henry
5 years ago
Views:

1 Kernel Method: Data Analysis with Positive Definite Kernels 4. Support Vector Machine Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University of Advanced Studies / Tokyo Institute of Technology Nov , 2010 Intensive Course at Tokyo Institute of Technology

2 2 / 50 Outline A quick course on convex optimization Convexity and convex optimization Dual problem for optimization Optimization in learning of SVM Dual problem and support vectors Sequential Minimal Optimization (SMO) Other approaches Extension of SVM Multiclass classification with SVM Combination of binary classifiers Structured output and others

3 3 / 50 Optimization of SVM 1 min w i,b,ξ i 2 subj. to N w i w j k(x i, X j ) + C i,j=1 N ξ i, i=1 { Yi ( N j=1 k(x i, X j )w j + b) 1 ξ i, ξ i 0. Quadratic programming (QP). Special case of convex optimization. The QP for SVM can be solved in the above form, but the dual form is easier.

4 4 / 50 A quick course on convex optimization Convexity and convex optimization Dual problem for optimization Optimization in learning of SVM Dual problem and support vectors Sequential Minimal Optimization (SMO) Other approaches Extension of SVM Multiclass classification with SVM Combination of binary classifiers Structured output and others

5 5 / 50 Convexity I For the details on convex optimization, see [BV04]. Convex set: A set C in a vector space is convex if for every x, y C and t [0, 1] tx + (1 t)y C. Convex function: Let C be a convex set. f : C R is called a convex function if for every x, y C and t [0, 1] f(tx + (1 t)y) tf(x) + (1 t)f(y). Concave function: Let C be a convex set. f : C R is called a concave function if for every x, y C and t [0, 1] f(tx + (1 t)y) tf(x) + (1 t)f(y).

6 6 / 50 Convexity II convex set non-convex set convex function concave function

7 7 / 50 Convexity III Fact: If f : C R is a convex function, the set {x C f(x) α} is a convex set for every α R. If f t (x) : C R (t T ) are convex, then is also convex. f(x) = sup t T f t (x) α

8 8 / 50 Convex Optimization I A general form of convex optimization D: convex set in R n. f(x), h i (x) (1 i l): D R, convex functions on D. a i R n, b j R (1 j m). min f(x) x D subject to { hi (x) 0 (1 i l), a T j x + b j = 0 (1 j m). h i : inequality constraints, r j (x) = a T j x + b j: linear equality constraints. Feasible set: F = {x D h i (x) 0 (1 i l), r j (x) = 0 (1 j m)}. The above optimization problem is called feasible if F.

9 9 / 50 Convex Optimization II Fact 1. Fact 2. The feasible set is a convex set. The set of minimizers X opt = { x F f(x) = inf{f(y) y F} } is convex. No local minima for convex optimization. proof. The intersection of convex sets is convex, which leads (1). Let p = inf x F f(x). Then, X opt = {x D f(x) p } F. Both sets in r.h.s. are convex. This proves (2)

10 Examples Linear program (LP) min c T x subject to { Ax = b, Gx h. 1 The objective function, the equality and inequality constraints are all linear. Quadratic program (QP) min 1 2 xt P x + q t x + r subject to { Ax = b, Gx h, where P is a positive semidefinite matrix. Objective function: quadratic. Equality, inequality constraints: linear. 1 Gx h denotes g T j x h j for all j, where G = (g 1,..., g m) T. 10 / 50

11 11 / 50 A quick course on convex optimization Convexity and convex optimization Dual problem for optimization Optimization in learning of SVM Dual problem and support vectors Sequential Minimal Optimization (SMO) Other approaches Extension of SVM Multiclass classification with SVM Combination of binary classifiers Structured output and others

12 12 / 50 Lagrange Dual Consider an optimization problem (which may not be convex): (primal) min f(x) x D subject to { hi (x) 0 (1 i l), r j (x) = 0 (1 j m). Lagrange dual function: g : R l R m [, ) where L(x, λ, µ) = f(x) + g(λ, ν) = inf L(x, λ, ν), x D l λ i h i (x) + i=1 m ν j r j (x). j=1 λ i and ν j are called Lagrange multipliers. g is a concave function.

13 13 / 50 Dual Problem and Weak Duality I Dual problem (dual) max g(λ, ν) subject to λ 0. The dual and primal problems have close connection. Theorem 1 (weak duality) Let p = inf{f(x) h i (x) 0 (1 i l), r j (x) = 0 (1 j m)}. and Then, d = sup{g(λ, ν) λ 0, ν R m }. d p. The weak duality does not require the convexity of the primal optimization problem.

14 14 / 50 Dual Problem and Weak Duality II Proof. Let λ 0, ν R m. For any feasible point x, L(x, λ, ν) = f(x) + l i=1 λ ih i (x) + m j=1 ν jr j (x) f(x). (The second term is non-positive, and the third term is zero.) By taking infimum, Thus, inf L(x, λ, ν) x:feasible p. g(λ, ν) = inf L(x, λ, ν) inf L(x, λ, ν) p x D x:feasible for any λ 0, ν R m.

15 Strong Duality We need some conditions to obtain the strong duality d = p. Convexity of the problem: f and h i are convex, r j are linear. Slater s condition There is x relintd such that h i ( x) < 0 (1 i l), r j ( x) = a T j x+b j = 0 (1 j m). Theorem 2 (Strong duality) Suppose the primal problem is convex, and Slater s condition holds. Then, there is λ 0 and ν R m such that g(λ, ν ) = d = p. Proof is omitted (see [BV04] Sec ). There are also other conditions to guarantee the strong duality. 15 / 50

16 16 / 50 Complementary Slackness I Consider the (not necessarily convex) optimization problem: min f(x) subject to { hi (x) 0 (1 i l), r j (x) = 0 (1 j m). Assumption: the optimum of the primal/dual problems are given by x and (λ, ν ) (λ 0), and they satisfy the strong duality: g(λ, ν ) = f(x ).

17 17 / 50 Complementary Slackness II Observation: f(x ) = g(λ, ν ) = inf x D L(x, λ, ν ) L(x, λ, ν ) [definition] = f(x ) + l i=1 λ i h i (x ) + m j=1 ν j r j (x ) f(x ) [2nd 0 and 3rd = 0] The two inequalities are in fact equalities.

18 18 / 50 Complementary Slackness III Consequence 1: x minimizes L(x, λ, ν ) (Primal solution by unconstrained optimization) Consequence 2: λ i h i (x ) = 0 for all i The latter is called complementary slackness. Equivalently, λ i > 0 h i (x ) = 0, or h i (x ) < 0 λ i = 0.

19 19 / 50 KKT Condition I KKT conditions give useful relations between the primal and dual solutions. Consider the convex optimization problem. Assume D is open and f(x), h i (x) are differentiable. min f(x) subject to { hi (x) 0 (1 i l), r j (x) = 0 (1 j m). x and (λ, ν ): any optimal points of the primal and dual problems. Assume strong duality: f(x ) = g(λ, ν ). From Consequence 1 (x = arg min L(x, λ, ν )), f(x ) + l i=1 λ i g i (x ) + m j=1 ν j r j (x ) = 0.

20 20 / 50 KKT Condition II The following are necessary conditions. Karush-Kuhn-Tucker (KKT) conditions: h i (x ) 0 (i = 1,..., l) [primal constraints] r j (x ) = 0 (j = 1,..., m) [primal constraints] λ i 0 (i = 1,..., l) [dual constraints] λ i h i (x ) = 0 (i = 1,..., l) [complementary slackness] f(x ) + l i=1 λ i g i (x ) + m j=1 ν j r j (x ) = 0. Theorem 3 (KKT condition) For a convex optimization problem with differentiable functions, x and (λ, ν ) are the primal-dual solutions with strong duality if and only if they satisfy KKT conditions.

21 21 / 50 Example Quadratic minimization under equality constraints. min 1 2 xt P x + q T x + r subject to Ax = b. KKT conditions: Ax = b, [primal constraint] x L(x, ν ) = 0 = P x + q + A T ν = 0 The solution is given by ( ) ( ) ( ) P A T x q A O ν = b.

22 22 / 50 A quick course on convex optimization Convexity and convex optimization Dual problem for optimization Optimization in learning of SVM Dual problem and support vectors Sequential Minimal Optimization (SMO) Other approaches Extension of SVM Multiclass classification with SVM Combination of binary classifiers Structured output and others

23 23 / 50 Primal Problem of SVM SVM primal problem: 1 min w i,b,ξ i 2 subj. to N w i w j k(x i, X j ) + C i,j=1 N ξ i, i=1 { Yi ( N j=1 k(x i, X j )w j + b) 1 ξ i, ξ i 0. The QP for SVM can be solved in the primal form, but the dual form is easier.

24 24 / 50 SVM Dual problem: Dual Problem of SVM max α N α i 1 2 i=1 N α i α j Y i Y j K ij i,j=1 subj. to { 0 αi C, N i=1 α iy i = 0 where K ij = k(x i, X i ). Solve it by a QP solver. Note: the constraints are simpler than the primal problem. Derivation [Exercise]. Hint: Compute the Lagrange dual function g(α, β) from L(w, b, ξ, α, β) = 1 N 2 i,j=1 w iw j k(x i, X j ) + C N i=1 ξ i + N i=1 α i{1 Y i ( N j=1 w jk(x i, X j ) + b) ξ i } + N i=1 β i( ξ i ).

25 25 / 50 KKT Conditions of SVM KKT conditions (1) 1 Y i f (X i ) ξi 0 ( i), [primal constraints] (2) ξi 0 ( i), [primal constraints] (3) αi 0, ( i), [dual constraints] (4) βi 0, ( i), [dual constraints] (5) αi (1 Y if (X i ) ξi ) = 0 ( i), [complementary slackness] (6) βi ξ i = 0 ( i), [complementary slackness] (7) w : n j=1 K ijwj n j=1 α j Y jk ij, b : n j=1 α j Y j = 0, ξ : C αi β i = 0 ( i).

26 Solution of SVM SVM solution in dual form f(x) = N αi Y i k(x, X i ) + b. i=1 (Use KKT condition (7)). How to solve b? shown later. 26 / 50

27 27 / 50 Complementary slackness Support Vectors I α i (1 Y i f (X i ) ξ i ) = 0 ( i), (C α i )ξ i = 0 ( i). If α i = 0, then ξ i = 0, and Y i f (X i ) 1. [well separated] Support vectors If 0 < α i < C, then ξ i = 0 and Y i f (X i ) = 1. [on the margin border] If α i = C, Y i f (X i ) 1. [within the margin]

28 28 / 50 Support Vectors II Sparse representation: the optimum classifier is expressed only with the support vectors. f(x) = αi Y i k(x, X i ) + b i:support vector w support vectors 0 < α i < C (Y i f(x i ) = 1) support vectors α i = C (Y i f(x i ) 1)

29 29 / 50 How to Solve b The optimum value of b is given by the complementary slackness. For any i with 0 < α i < C, Y i ( j k(x i, X j )Y j α j + b ) = 1. Use the above relation for any of such i, or take the average over all of such i.

30 30 / 50 A quick course on convex optimization Convexity and convex optimization Dual problem for optimization Optimization in learning of SVM Dual problem and support vectors Sequential Minimal Optimization (SMO) Other approaches Extension of SVM Multiclass classification with SVM Combination of binary classifiers Structured output and others

31 31 / 50 Computational Problem in Solving SVM The dual QP problem of SVM has N variables, where N is the sample size. If N is very large, say N = , the optimization is very hard. Some approaches have been proposed for optimizing subsets of the variables sequentially. Chunking [Vap82] Osuna s method [OFG] Sequential minimal optimization (SMO) [Pla99] SVM light (

32 32 / 50 Sequential Minimal Optimization (SMO) I Solve small QP problems sequentially for a pair of variables (α i, α j ). How to choose the pair? Intuition from the KKT conditions is used. After removing w, ξ, and β, the KKT conditions of SVM are equivalent to (see Appendix) N αi = 0 and Y if (X i ) 1, Y i αi = 0 and ( ) 0 < αi i=1 < C and Y if (X i ) = 1, αi = C and Y if (X i ) 1. The conditions (*) can be checked for each data point. Choose such (i, j) that at least one of them breaks (*).

33 33 / 50 Sequential Minimal Optimization (SMO) II The QP problem for (α i, α j ) is analytically solvable! For simplicity, assume (i, j) = (1, 2). Constraint of α 1 and α 2 : α 1 + s 12 α 2 = γ, 0 α 1, α 2 C, where s 12 = Y 1 Y 2 and γ = ± l 3 Y lα l is constat. Objective function: α 1 + α α2 1K α2 2K 22 s 12 α 1 α 2 K 12 Y 1 α 1 j 3 Y jα j K 1j Y 2 α 2 j 3 Y jα j K 2j + const. This optimization is a quadratic optimization of one variable on an interval. Directly solved.

34 34 / 50 A quick course on convex optimization Convexity and convex optimization Dual problem for optimization Optimization in learning of SVM Dual problem and support vectors Sequential Minimal Optimization (SMO) Other approaches Extension of SVM Multiclass classification with SVM Combination of binary classifiers Structured output and others

35 35 / 50 Other Approaches to Optimization of SVM Recent studies (not a compete list). Solution in primal. O. Chapelle [Cha07], T. Joachims, SVM perf [Joa06], S. Shalev-Shwartz et al. [SSSS07], etc. Online SVM. Tax and Laskov [TL03] LaSVM [BEWB05] Parallel computation Cascade SVM [GCB + 05] Zanni et al [ZSZ06] Geometric approach Mafrovorakis and Theodoridis [MT06].

36 36 / 50 A quick course on convex optimization Convexity and convex optimization Dual problem for optimization Optimization in learning of SVM Dual problem and support vectors Sequential Minimal Optimization (SMO) Other approaches Extension of SVM Multiclass classification with SVM Combination of binary classifiers Structured output and others

37 37 / 50 Overview of Multiclass Classification I Multiclass classification: (X 1, Y 1 ),..., (X N, Y N ): data X i : explanatory variable Y i {C 1,..., C L }: labels for L classes. e.g. Digit classification L = 10. Make a classifier: h : X {1, 2,..., L}. The original SVM is applicable only to binary classification problems. There are some approaches to extending SVM to multiclass classification. Direct construction of a large margin multiclass classifier. Combination of binary classifiers.

38 Overview of Multiclass Classification II Various methods (incomplete list). Direct approach: Multiclass SVM ([CS01],[WW98], [BB99], [LLW] etc.) Kernel logistic regression ([ZH02], K.Tanabe, [KDSP05]) and others Combination approach: How to divide the problem - one-vs-rest (one-vs-all) - one-vs-one - Error correcting output code (ECOC) [DB95] How to combine the binary classifiers - Hamming decoding - Bradly-Terry model ([HT98], [HWL06]) 38 / 50

39 39 / 50 Multiclass SVM I Multiclass SVM (Crammer & Singer 2001) Large margin criterion is generalized to multiclass cases. Efficient optimization. Implemented in SVM light. Linear classifier for L-class classification Data: (X 1, Y i ),..., (X N, Y N ), X i R m, Y i {1,..., L}. Classifier: h(x) = arg max l=1,...,l wt l x. L linear classifiers are used. (The bias term b l is omitted for simplicity.) wl T x (l = 1,..., L) is the similarity score for the class l. The class of the largest similarity is the answer of the classifier.

40 40 / 50 Multiclass SVM II Margin for multiclass problem: Margin i = w T Y i X i max l Y i w T l X i. W = (w 1,..., w L ) correctly classifies the data (X i, Y i ), if and only if Margin i 0. The scale of the margin must be fixed. Primal problem of multiclass SVM: min W,ξ β 2 W 2 + N ξ i subj. to wy T i X i +δ lyi wl T X i 1 ξ i ( l, i) i=1 Note: ξ i represents the break of separability. # dual variable = NL. Computational cost must be reduced by some methods.

41 41 / 50 Multiclass SVM III Meaning of margin score score ξ i score ξ i class class class

42 42 / 50 A quick course on convex optimization Convexity and convex optimization Dual problem for optimization Optimization in learning of SVM Dual problem and support vectors Sequential Minimal Optimization (SMO) Other approaches Extension of SVM Multiclass classification with SVM Combination of binary classifiers Structured output and others

43 43 / 50 Combination of Binary Classifiers Base classifiers: make use of strong binary classifiers. e.g. SVM, AdaBoost, etc. Decomposition of a multiclass classification into binary classifications 1-vs-rest i-class vs the other classes : L problems 1-vs-1 i-class vs j-class ( i, j) : L(L 1)/2 problems More general approach = Error correcting output code (ECOC). ECOC attributes a code for each class. class f 1 f 2 f 3 f 4 f 5 f 6 C C C C

44 44 / 50 class f 1 f 2 f 3 f 4 C C C C vs-rest class f 1 f 2 f 3 f 4 f 5 f 6 C C C C vs-1

45 45 / 50 Combining Base Classifiers Hamming decoding for ECOC: Let W la be the code of ECOC for the class l and classifier f a (1 l L, 1 a M). h(x) = arg min l w l f(x) Hamming, where f(x) = (f 1 (x),..., f M (x)) {±1} M. This is equivalent to h(x) = arg max M a=1 W laf a (x). l In the case of one-vs-one, Hamming decoding coincides with majority vote. Bradly-Terry model: A probabilistic model for paired comparison. It can be applied when the output of f i (x) is continuous.

46 46 / 50 A quick course on convex optimization Convexity and convex optimization Dual problem for optimization Optimization in learning of SVM Dual problem and support vectors Sequential Minimal Optimization (SMO) Other approaches Extension of SVM Multiclass classification with SVM Combination of binary classifiers Structured output and others

47 47 / 50 Structured Output The output of prediction may be structured object, such as label sequences (strings), trees, and graphs.

48 48 / 50 Large Margin Approach to Structured Output I References Application to natural language processing [Col02]. Max-Margin Markov Network (M 3 N) [TGK04]. Hidden Markov support vector machine [ATH03]. Approach (X 1, Y 1 ),..., (X N, Y N ): data X i : input variable, Y i Y: structured object. Feature vector F (x, y) = (f 1 (x, y),..., f M (x, y)) Make a classifier: h : X Y h(x) = arg max y Y wt F (x, y).

49 49 / 50 Large Margin Approach to Structured Output II Formulate the problem as a multiclass classification. Each y Y is regarded as a class. Multiclass SVM gives min W,ξ β 2 w 2 + N i=1 ξ i subj. to w T F (X i, Y i ) + δ yyi w T F (X i, y) 1 ξ i ( i, y Y). Problem: # constrains (= # dual variables) = Y. Prohibitive in many cases! E.g. for label sequence, Y = Alphabet length. The computational cost must be reduced by some methods (e.g. [TGK04, ATH03]).

50 50 / 50 Other Topics Support vector regression. [MM00] ν-svm: Another formulation of soft margin. [SSWB00] ν = an upper bound on the fraction of margin errors. ν = the lower bound on the fraction of support vectors. One-class SVM: (similar to estimating a level set of density function.) Large margin approach to ranking.

51 51 / 50 References I [ATH03] [BB99] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov support vector machines. In Proceedings of the 20th International Conference on Machine Learning, Erin J. Bredensteiner and Kristin P. Bennett. Multicategory classification by support vector machines. Computational Optimizations and Applications, 12, [BEWB05] Antoine Bordes, Seyda Ertekin, Jason Weston, and Léon Bottou. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6: , [BV04] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, boyd/cvxbook/.

52 52 / 50 [Cha07] [Col02] [CS01] [DB95] Olivier Chapelle. References II Training a support vector machine in the primal. Neural Computation, 19: , Michael Collins. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2: , Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2: , 1995.

53 53 / 50 References III [GCB + 05] Hans Peter Graf, Eric Cosatto, Léon Bottou, Igor Dourdanovic, and Vladimir Vapnik. Parallel support vector machines: The Cascade SVM. In Lawrence Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems, volume 17. MIT Press, [HT98] [HWL06] T. Hastie and R. Tibshirani. Classification by pairwise coupling. The Annals of Statistics, 26(1): , Tzu-Kuo Huang, Ruby C. Weng, and Chih-Jen Lin. Generalized Bradly-Terry models and multi-class probability estimates. Journal of Machine Learning Research, 7:85 115, 2006.

54 54 / 50 References IV [Joa06] [KDSP05] [LLW] [MM00] Thorsten Joachims. Training linear svms in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), S. S. Keerthi, K. B. Duan, S. K. Shevade, and A. N. Poo. A fast dual algorithm for kernel logistic regression. Machine Learning, 61(1 3): , Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99. O. L. Mangasarian and D. R. Musicant. Robust linear and support vector regression. IEEE Trans. Pattern Analysis Machine Intelligence, 22, 2000.

55 55 / 50 References V [MT06] [OFG] [Pla99] M.E. Mavroforakis and S. Theodoridis. A geometric approach to support vector machine (SVM) classification. IEEE Trans. Neural Networks, 17(3), Edgar Osuna, Robert Freund, and Federico Girosi. An improved training algorithm for support vector machines. In Proceedings of the 1997 IEEE Workshop on Neural Networks for Signal Processing (IEEE NNSP 1997), pages John Platt. Fast training of support vector machines using sequential minimal optimization. In Bernhard Schölkopf, Cristopher Burges, and Alexander Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages MIT Press, 1999.

56 56 / 50 References VI [Shi08] [SSSS07] Yuichi Shiraishi. Game-theoretical and statistical study on combination of binary classifiers for multi-class classification. Ph.D. thesis, Department of Statistical Science, The Graduate University for Advanced Studies, Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In Proc. International Conerence of Machine Learning, [SSWB00] B. Schölkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Computation, 12: , [TGK04] Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin markov networks. In Sebastian Thrun, Lawrence Saul, and Bernhard Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.

57 57 / 50 References VII [TL03] [Vap82] [WW98] [ZH02] D.M.J. Tax and P. Laskov. Online svm learning: from classification to data description and back. In Proceddings of IEEE 13th Workshop on Neural Networks for Signal Processing (NNSP2003), pages , Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, J. Weston and C. Watkins. Multi-class support vector machines. Technical Report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, Ji Zhu and Trevor Hastie. Kernel logistic regression and the import vector machine. 14: , 2002.

58 58 / 50 References VIII [ZSZ06] Luca Zanni, Thomas Serafini, and Gaetano Zanghirati. Parallel software for training large scale support vector machines on multiprocessor systems. Journal of Machine Learning Research, 7: , 2006.

59 59 / 50 Proof. Appendix: Proof of KKT condition x is primal-feasible by the first two conditions. From λ i 0, L(x, λ, ν ) is convex (and differentiable). The last condition x L(x, λ, ν ) = 0 implies x is a minimizer. It follows g(λ, ν ) = inf x D L(x, λ, ν ) [by definition] = L(x, λ, ν ) [x : minimizer] = f(x ) + l i=1 λ i h i (x ) + m j=1 ν j r j (x ) = f(x ) [complementary slackness and r j (x ) = 0]. Strong duality holds, and x and (λ, ν ) must be the optimizers.

60 Appendix: KKT conditions revisited I β and w can be removed by ξ : βi = C αi ( i), w : n j=1 K ijwj = n j=1 α jy j K ij ( i). From KKT (4) and (6), α i C, ξ i (C α i ) = 0 ( i). The KKT conditions are equivalent to (a) 1 Y i f (X i ) ξi 0 ( i), (b) ξi 0 ( i), (c) 0 αi C ( i), (d) αi (1 Y if (X i ) ξi ) = 0 ( i), (e) ξi (C α i ) = 0 ( i), (f) N i=1 Y iαi = 0. and β i = C αi, n j=1 K ijwj = n j=1 α j Y jk ij. 60 / 50

61 61 / 50 Appendix: KKT conditions revisited II We can further remove ξ. Case α i = 0: From (e), ξ i = 0. Then, from (a), Y if (X i ) 1. Case 0 < α i < C: From (e), ξ i = 0. From (d), Y if (X i ) = 1. Case α i = C: From (d) and (b), ξ i = 1 Y if (X i ) 0. Note in all cases, (a) and (b) are satisfied. The KKT conditions are equivalent to N i=1 Y iαi = 0, and αi = 0 Y if (X i ) 1, (ξi = 0) 0 < αi < C Y if (X i ) = 1, (ξi = 0) αi = C Y if (X i ) 1, (ξi = 1 Y if (X i )).

Support Vector Machine I

Support Vector Machine I Statistical Data Analysis with Positive Definite Kernels Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University for Advanced