Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22
Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 2 / 22
Data Classification Basic concepts: SVM and kernels Given training data in different classes (labels known) Predict test data (labels unknown) Training and testing Chih-Jen Lin (National Taiwan Univ.) 3 / 22
Basic concepts: SVM and kernels Support Vector Classification Training vectors : x i, i = 1,..., l Feature vectors. For example, A patient = [height, weight,...] T Consider a simple case with two classes: Define an indicator vector y { 1 if xi in class 1 y i = 1 if x i in class 2 A hyperplane which separates all data Chih-Jen Lin (National Taiwan Univ.) 4 / 22
Basic concepts: SVM and kernels A separating hyperplane: w T x + b = 0 (w T x i ) + b 1 if y i = 1 (w T x i ) + b 1 if y i = 1 w T x + b = [ +1 ] 0 1 Decision function f (x) = sgn(w T x + b), x: test data Many possible choices of w and b Chih-Jen Lin (National Taiwan Univ.) 5 / 22
Maximal Margin Basic concepts: SVM and kernels Distance between w T x + b = 1 and 1: 2/ w = 2/ w T w A quadratic programming problem (Boser et al., 1992) min w,b 1 2 wt w subject to y i (w T x i + b) 1, i = 1,..., l. Chih-Jen Lin (National Taiwan Univ.) 6 / 22
Basic concepts: SVM and kernels Data May Not Be Linearly Separable An example: Allow training errors Higher dimensional ( maybe infinite ) feature space φ(x) = [φ 1 (x), φ 2 (x),...] T. Chih-Jen Lin (National Taiwan Univ.) 7 / 22
Basic concepts: SVM and kernels Standard SVM (Boser et al., 1992; Cortes and Vapnik, 1995) min w,b,ξ 1 2 wt w + C l i=1 subject to y i (w T φ(x i ) + b) 1 ξ i, ξ i 0, i = 1,..., l. Example: x R 3, φ(x) R 10 φ(x) = [1, 2x 1, 2x 2, 2x 3, x 2 1, x 2 2, x 2 3, 2x 1 x 2, 2x 1 x 3, 2x 2 x 3 ] T ξ i Chih-Jen Lin (National Taiwan Univ.) 8 / 22
Basic concepts: SVM and kernels Finding the Decision Function w: maybe infinite variables The dual problem: finite number of variables min α subject to 1 2 αt Qα e T α 0 α i C, i = 1,..., l y T α = 0, where Q ij = y i y j φ(x i ) T φ(x j ) and e = [1,..., 1] T At optimum w = l i=1 α iy i φ(x i ) A finite problem: #variables = #training data Chih-Jen Lin (National Taiwan Univ.) 9 / 22
Kernel Tricks Basic concepts: SVM and kernels Q ij = y i y j φ(x i ) T φ(x j ) needs a closed form Example: x i R 3, φ(x i ) R 10 φ(x i ) = [1, 2(x i ) 1, 2(x i ) 2, 2(x i ) 3, (x i ) 2 1, (x i ) 2 2, (x i ) 2 3, 2(x i ) 1 (x i ) 2, 2(x i ) 1 (x i ) 3, 2(x i ) 2 (x i ) 3 ] T Then φ(x i ) T φ(x j ) = (1 + x T i x j ) 2. Kernel: K(x, y) = φ(x) T φ(y); common kernels: e γ x i x j 2, (Radial Basis Function) (x T i x j /a + b) d (Polynomial kernel) Chih-Jen Lin (National Taiwan Univ.) 10 / 22
Basic concepts: SVM and kernels Can be inner product in infinite dimensional space Assume x R 1 and γ > 0. e γ x i x j 2 = e γ(x i x j ) 2 = e γx i 2+2γx ix j γxj 2 ( 2γx i x j 1 + + (2γx ix j ) 2 =e γx 2 i γx 2 j =e γx 2 i γx 2 j where + (2γx ix j ) 3 + ) 1! 2! 3! ( 2γ 2γ (2γ) 1 1+ 1! x i 1! x 2 (2γ) j + xi 2 2 2! 2! (2γ) 3 (2γ) + xi 3 3 xj 3 + ) = φ(x i ) T φ(x j ), 3! 3! φ(x) = e γx 2 [1, ] 2γ (2γ) 1! x, 2 (2γ) x 2 3 T, x 3,. 2! 3! x 2 j Chih-Jen Lin (National Taiwan Univ.) 11 / 22
Decision function Basic concepts: SVM and kernels At optimum Decision function = = w = l i=1 α iy i φ(x i ) w T φ(x) + b l α i y i φ(x i ) T φ(x) + b i=1 l α i y i K(x i, x) + b i=1 Only φ(x i ) of α i > 0 used support vectors Chih-Jen Lin (National Taiwan Univ.) 12 / 22
Basic concepts: SVM and kernels Support Vectors: More Important Data Only φ(x i ) of α i > 0 used support vectors 1.2 1 0.8 0.6 0.4 0.2 0-0.2-1.5-1 -0.5 0 0.5 1 Chih-Jen Lin (National Taiwan Univ.) 13 / 22
Outline SVM primal/dual problems Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 14 / 22
Deriving the Dual SVM primal/dual problems Consider the problem without ξ i 1 min w,b 2 wt w subject to y i (w T φ(x i ) + b) 1, i = 1,..., l. Its dual min α 1 2 αt Qα e T α subject to 0 α i, i = 1,..., l, y T α = 0. Chih-Jen Lin (National Taiwan Univ.) 15 / 22
Lagrangian Dual SVM primal/dual problems where ( max min L(w, b, α)), α 0 w,b L(w, b, α) = 1 2 w 2 l ( α i yi (w T φ(x i ) + b) 1 ) i=1 Strong duality ( min Primal = max min α 0 w,b L(w, b, α)) Chih-Jen Lin (National Taiwan Univ.) 16 / 22
SVM primal/dual problems Simplify the dual. When α is fixed, min L(w, b, α) = w,b { if l i=1 α iy i 0, 1 min w 2 wt w l i=1 α i[y i (w T φ(x i ) 1] if l i=1 α iy i = 0. If l i=1 α iy i 0, decrease in L(w, b, α) to b l α i y i i=1 Chih-Jen Lin (National Taiwan Univ.) 17 / 22
SVM primal/dual problems If l i=1 α iy i = 0, optimum of the strictly convex 1 2 wt w l i=1 α i[y i (w T φ(x i ) 1] happens when Thus, L(w, b, α) = 0. w w = l α i y i φ(x i ). i=1 Chih-Jen Lin (National Taiwan Univ.) 18 / 22
SVM primal/dual problems Note that w T w = = i,j ( l i=1 ) T ( l α i y i φ(x i ) j=1 α i α j y i y j φ(x i ) T φ(x j ) ) α j y j φ(x j ) max α 0 The dual is l α i 1 2 α i α j y i y j φ(x i ) T φ(x j ) i=1 i,j if l i=1 α iy i = 0, if l i=1 α iy i 0. Chih-Jen Lin (National Taiwan Univ.) 19 / 22
SVM primal/dual problems Lagrangian dual: max α 0 ( minw,b L(w, b, α) ) definitely not maximum of the dual Dual optimal solution not happen when. Dual simplified to max α R l l α i y i 0 i=1 l α i 1 2 i=1 l i=1 l α i α j y i y j φ(x i ) T φ(x j ) j=1 subject to y T α = 0, α i 0, i = 1,..., l. Chih-Jen Lin (National Taiwan Univ.) 20 / 22
SVM primal/dual problems Our problems may be infinite dimensional Can still use Lagrangian duality See a rigorous discussion in Lin (2001) Chih-Jen Lin (National Taiwan Univ.) 21 / 22
References I SVM primal/dual problems B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144 152. ACM Press, 1992. C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273 297, 1995. C.-J. Lin. Formulations of support vector machines: a note from an optimization point of view. Neural Computation, 13(2):307 317, 2001. Chih-Jen Lin (National Taiwan Univ.) 22 / 22