Support Vector Machines and Kernel Methods

2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html

Linear Classification For linear separable cases, we have multiple decision boundaries

Linear Classification For linear separable cases, we have multiple decision boundaries Ruling out some separators by considering data noise

Linear Classification For linear separable cases, we have multiple decision boundaries The intuitive optimal decision boundary: the largest margin

Review: Logistic Regression Logistic regression is a binary classification model p μ (y =1jx) =¾(μ > x)= p μ (y =0jx) = e μ> x 1+e μ> x 1 1+e μ> x Cross entropy loss function Gradient L(y;x; p μ )= ylog ¾(μ > x) (1 y)log(1 ¾(μ > x)) ¾(x) @L(y; x; p μ ) 1 = y @μ ¾(μ > x) ¾(z)(1 ¾(z))x (1 y) 1 1 ¾(μ > ¾(z)(1 ¾(z))x x) =(¾(μ > x) y)x @¾(z) μ Ã μ + (y ¾(μ > x))x = ¾(z)(1 ¾(z)) @z x

Label Decision Logistic regression provides the probability p μ (y =1jx) =¾(μ > x)= p μ (y =0jx) = e μ> x 1+e μ> x The final label of an instance is decided by setting a threshold h ^y = ( 1; p μ (y =1jx) >h 0; otherwise 1 1+e μ> x

Logistic Regression Scores x 2 s(x) =μ 0 + μ 1 x 1 + μ 2 x 2 p μ (y =1jx) = 1 1+e s(x) s(x) =μ 0 + μ 1 x (A) 1 + μ 2 x (A) 2 Decision boundary s(x) =μ 0 + μ 1 x (B) 1 + μ 2 x (B) 2 s(x) =μ 0 + μ 1 x (C) 1 + μ 2 x (C) 2 s(x) =0 The higher score, the larger distance to the decision boundary, the higher confidence x 1 Example from Andrew Ng

Linear Classification The intuitive optimal decision boundary: the highest confidence

Notations for SVMs Feature vector Class label Parameters Intercept b Feature weight vector Label prediction x y 2f 1; 1g w h w;b (x) =g(w > x + b) ( +1 z 0 g(z) = 1 otherwise

Logistic Regression Scores x 2 s(x) =b + w 1 x 1 + w 2 x 2 p μ (y =1jx) = 1 1+e s(x) s(x) =b + w 1 x (A) 1 + w 2 x (A) 2 Decision boundary s(x) =b + w 1 x (B) 1 + w 2 x (B) 2 s(x) =b + w 1 x (C) 1 + w 2 x (C) 2 s(x) =0 The higher score, the larger distance to the separating hyperplane, the higher confidence x 1 Example from Andrew Ng

Margins x 2 Functional margin ^ (i) = y (i) (w > x (i) + b) (B) w Note that the separating hyperplane won t change with the magnitude of (w, b) g(w > x + b) =g(2w > x +2b) Decision boundary Geometric margin (i) = y (i) (w > x (i) + b) where kwk 2 =1 x 1

Margins x 2 Decision boundary μ w > x (i) (i) y (i) w kwk + b =0 (B) w ) (i) = y (i) w> x (i) + b kwk μ ³ w >x = y (i) (i) + kwk b kwk Given a training set Decision boundary x 1 S = f(x i ;y i )g :::m the smallest geometric margin = min :::m (i)

Objective of an SVM Find a separable hyperplane that maximizes the minimum geometric margin max ;w;b s.t. y (i) (w > x (i) + b) ; i =1;:::;m kwk =1 (non-convex constraint) Equivalent to normalized functional margin max ^ ;w;b ^ kwk (non-convex objective) s.t. y (i) (w > x (i) + b) ^ ; i =1;:::;m

Objective of an SVM Functional margin scales w.r.t. (w,b) without changing the decision boundary. Let s fix the functional margin at 1. Objective is written as max w;b Equivalent with 1 kwk ^ =1 s.t. y (i) (w > x (i) + b) 1; ;:::;m min w;b 1 2 kwk2 s.t. y (i) (w > x (i) + b) 1; ;:::;m This optimization problem can be efficiently solved by quadratic programming

A Digression of Lagrange Duality in Convex Optimization Boyd, Stephen, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

Lagrangian for Convex Optimization A convex optimization problem min w f(w) s.t. h i (w) =0; i =1;:::;l The Lagrangian of this problem is defined as Solving L(w; ) =f(w)+ lx ih i (w) @L(w; ) @L(w; ) =0 =0 @w @ Lagrangian multipliers yields the solution of the original optimization problem.

Lagrangian for Convex Optimization w 2 f(w) =5 f(w) =4 f(w) =3 h(w) =0 L(w; ) =f(w)+ h(w) @L(w; ) @w = @f(w) @w w 1 + @h(w) @w =0 i.e., two gradients on the same direction

With Inequality Constraints A convex optimization problem min w f(w) s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l The Lagrangian of this problem is defined as L(w; ; ) =f(w)+ kx i g i (w)+ lx ih i (w) Lagrangian multipliers

Primal Problem A convex optimzation min w f(w) s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l The primal problem The Lagrangian kx L(w; ; ) =f(w)+ i g i (w)+ lx ih i (w) μ P P(w) = max ; : i 0 L(w; ; ) If a given w violates any constraints, i.e., Then μ P (w) =+1 g i (w) > 0 or h i (w) 6= 0

Primal Problem A convex optimzation min w f(w) s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l The primal problem The Lagrangian kx L(w; ; ) =f(w)+ i g i (w)+ lx ih i (w) μ P P(w) = max ; : i 0 L(w; ; ) Conversely, if all constraints are satisfied for Then μ P (w) =f(w) μ P (w) = ( f(w) w if w satis es primal constraints +1 otherwise

Primal Problem μ P (w) = ( f(w) if w satis es primal constraints +1 otherwise The minimization problem min μ P(w) =min w w max ; : i 0 L(w; ; ) is the same as the original problem min f(w) w s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l Define the value of the primal problem p =min μ P(w) w

Dual Problem A slightly different problem Define the dual optimization problem Min & Max exchanged compared to the primal problem min μ P(w) =min w w max ; : i 0 Define the value of the dual problem d d = max min ; : i 0 w L(w; ; ) L(w; ; )

Primal Problem vs. Dual Problem d = max min ; : i 0 w L(w; ; ) min w max ; : i 0 L(w; ; ) =p Proof ) max min L(w; ; ) L(w; ; ); 8w; 0; w min ; : 0 w ) max min ; : 0 w L(w; ; ) max ; : 0 L(w; ; ) min w L(w; ; ); 8w max ; : 0 L(w; ; ) But under certain condition d = p

Karush-Kuhn-Tucker (KKT) Conditions If f and g i s are convex and h i s are affine, and suppose g i s are all strictly feasible then there must exist w *, α *, β * w * is the solution of the primal problem α *, β * are the solutions of the dual problem and the values of the two problems are equal And w *, α *, β * satisfy the KKT conditions KKT dual complementarity condition Moreover, if some w *, α *, β * satisfy the KKT conditions, then it is also a solution to the primal and dual problems. More details please refer to Boyd Convex optimization 2004.

Now Back to SVM Problem

Objective of an SVM SVM objective: finding the optimal margin classifier min w;b 1 2 kwk2 s.t. y (i) (w > x (i) + b) 1; ;:::;m Re-wright the constraints as g i (w) = y (i) (w > x (i) + b)+1 0 so as to match the standard optimization form min w f(w) s.t. g i (w) 0; h i (w) =0; i =1;:::;k i =1;:::;l

Equality Cases g i (w) = y (i) (w > x (i) + b)+1=0 )y (i)³ w > x (i) kwk + b = 1 kwk kwk Geometric margin The g i s = 0 cases correspond to the training examples that have functional margin exactly equal to 1.

Objective of an SVM SVM objective: finding the optimal margin classifier min w;b 1 2 kwk2 s.t. y (i) (w > x (i) + b)+1 0; i =1;:::;m Lagrangian L(w; b; ) = 1 2 kwk2 i [y (i) (w > x (i) + b) 1] No β or equality constraints in SVM problem

Solving L(w; b; ) = 1 2 kwk2 Derivatives @ m @w L(w; b; ) =w X i y (i) x (i) =0 ) w = L(w; b; ) = 1 2 = @ m @b L(w; b; ) = X i y (i) =0 i [y (i) (w > x (i) + b) 1] Then Lagrangian is re-written as i y (i) x (i) i y (i) x (i) 2 X m i [y (i) (w > x (i) + b) 1] i 1 2 i;j=1 y (i) y (j) i j x (i)> x (j) b i y (i) = 0

Solving α * Dual problem max max μ D( ) =max 0 W ( ) = i 1 2 s.t. i 0; ;:::;m i y (i) =0 min 0 w;b i;j=1 L(w; b; ) To solve α * with some methods e.g. SMO We will get back to this solution later y (i) y (j) i j x (i)> x (j)

Solving w * and b * With α * solved, w * is obtained by w = i y (i) x (i) Only supporting vectors with α > 0 With w * solved, b * is obtained by b = max i:y (i) = 1 w > x (i) +min i:y (i) =1 w > x (i) 2 w

Predicting Values With the solutions of w * and b *, the predicting value (i.e. functional margin) of each instance is w > x + b = = ³ X m i y (i) x (i) > x + b i y (i) hx (i) ;xi + b We only need to calculate the inner product of x with the supporting vectors

Non-Separable Cases The derivation of the SVM as presented so far assumes that the data is linearly separable. More practical cases are linearly non-separable.

Dealing with Non-Separable Cases Add slack variables Lagrangian L(w; b;»; ; r) = 1 2 w> w + C Dual problem max W ( ) = min w;b 1 2 kwk2 + C» i L1 regularization s.t. y (i) (w > x (i) + b) 1» i ;;:::;m» i 0; i =1;:::;m» i i [y (i) (x > w + b) 1+» i ] r i» i i 1 2 y (i) y (j) i j x (i)> x (j) i;j=1 s.t. 0 i C; i =1;:::;m i y (i) =0 Surprisingly, this is the only change Efficiently solved by SMO algorithm

SVM Hinge Loss vs. LR Loss SVM Hinge loss 1 2 kwk2 + C max(0; 1 y i (w > x i + b)) LR log loss y i log ¾(w > x i + b) (1 y i )log(1 ¾(w > x i + b)) If y = 1

Coordinate Ascent (Descent) For the optimization problem max W ( 1; 2 ;:::; m ) Coordinate ascent algorithm Loop until convergence: f For i =1;:::;m f i := arg max W ( 1 ;:::; i 1 ; ^ i ; i+1 ;:::; m ) ^ i g g

Coordinate Ascent (Descent) A two-dimensional coordinate ascent example

SMO Algorithm SMO: sequential minimal optimization SVM optimization problem max W ( ) = i 1 2 y (i) y (j) i j x (i)> x (j) i;j=1 s.t. 0 i C; i =1;:::;m i y (i) =0 Cannot directly apply coordinate ascent algorithm because i y (i) =0 ) i y (i) = X j y (j) j6=i

SMO Algorithm Update two variable each time Loop until convergence { 1. Select some pair α i and α j to update next 2. Re-optimize W(α) w.r.t. α i and α j } Convergence test: whether the change of W(α) is smaller than a predefined value (e.g. 0.01) Key advantage of SMO algorithm is the update of α i and α j (step 2) is efficient

SMO Algorithm max W ( ) = i 1 2 i;j=1 s.t. 0 i C; i =1;:::;m i y (i) =0 Without loss of generality, hold α 3 α m and optimize W(α) w.r.t. α 1 and α 2 1 y (1) + 2 y (2) = y (i) y (j) i j x (i)> x (j) i y (i) = ³ i=3 ) 2 = y(1) y (2) 1 + ³ y (2) 1 =(³ 2 y (2) )y (1)

SMO Algorithm With 1 =(³ 2 y (2) )y (1), the objective is written as W ( 1 ; 2 ;:::; m )=W((³ 2 y (2) )y (1) ; 2 ;:::; m ) Thus the original optimization problem max W ( ) = i 1 2 y (i) y (j) i j x (i)> x (j) i;j=1 s.t. 0 i C; i =1;:::;m i y (i) =0 is transformed into a quadratic optimization problem w.r.t. 2 max W ( 2 )=a 2 2 + b 2 + c 2 s.t. 0 2 C

SMO Algorithm Optimizing a quadratic function is much efficient W ( 2 ) W ( 2 ) W ( 2 ) 0 b 2a C 2 0 C 2 0 C 2 max W ( 2 )=a 2 2 + b 2 + c 2 s.t. 0 2 C

Kernel Methods

Non-Separable Cases More practical cases are linearly non-separable. May be solved by slack variables Linearly separable case Cannot be solved by slack variables

Non-Separable Cases More practical cases are linearly non-separable. Solution: mapping feature vectors to a higher-dimensional space An example Á(x) Á(x) =x 2 x x

Non-Separable Cases More generally, mapping feature vectors to a different space Á(x) Figures from Stuart Russell

Feature Mapping Functions SVM only cares about the inner products W ( ) = i 1 2 y (i) y (j) i j x (i)> x (j) i;j=1 With the feature mapping function Á(x) W ( ) = i 1 2 y (i) y (j) i j K(x (i) ;x (j) ) i;j=1 Kernel K(x (i) ;x (j) )=Á(x (i) ) > Á(x (j) )

Kernel With the example feature mapping function 2 3 x Á(x) = 4x 2 5 The corresponding kernel is x 2 x 3 K(x (i) ;x (j) )=Á(x (i) ) > Á(x (j) ) = x (i) x (j) + x (i)2 x (j)2 + x (i)3 x (j)3 [Kernel Trick] For lots of cases, we only need K(x (i) ;x (j) ), thus we can directly define K(x (i) ;x (j) ) without explicitly defining For example, suppose K(x (i) ;x (j) ) Á(x (i) ) x (i) ;x (j) 2 R n K(x (i) ;x (j) )=(x (i)> x (j) ) 2

Kernel Example For example, suppose x (i) ;x (j) 2 R n K(x (i) ;x (j) )=(x (i)> x (j) ) 2 ³ nx = = = k=1 nx x (i) k x(j) k nx k=1 l=1 nx ³ nx l=1 x (i) k x(j) k x(i) l x (j) l (x (i) k x(i) l )(x (j) k x(j) l ) k;l=1 x (i) l x (j) l If n = 3, the mapping function is ) Á(x) = Note that calculating Á(x) takes O(n 2 ) time, while calculating K(x (i) ;x (j) ) only takes O(n) time 2 3 x 1 x 1 x 1 x 2 x 1 x 3 x 2 x 1 x 2 x 2 x 2 x 3 6x 3 x 1 7 4 x 3 x 2 5 x 3 x 3

Kernel for Measuring Similarity Intuitively, for two instances x and z, if Á(x) and are close together, then we expect K(x; z) =Á(x) > Á(z) to be large, and vice versa. Gaussian kernel (a very widely used kernel) kx zk2 K(x; z) =exp μ 2¾ 2 Á(z) Also called radial basis function (RBF) kernel Then what is the feature mapping function for this kernel?

Kernel Matrix Consider a finite set of instances The corresponding Kernel Matrix K is defined as The kernel matrix K must be symmetric since fx (1) ;:::;x (m) g fk ij g i;j=1;:::;m K ij = K(x (i) ;x (j) )=Á(x (i) ) > Á(x (j) )=Á(x (j) ) > Á(x (i) )=K(x (j) ;x (i) )=K ji If we define Á k (x) as the k-th coordinate of the vector Á(x), then for any vector z 2 R m, we have z > Kz = X X z i K ij z j i j = X X z i Á(x (i) ) > Á(x (j) )z j = X X X z i Á k (x (i) )Á k (x (j) )z j i j i j k = X X X z i Á k (x (i) )Á k (x (j) )z j = X ³ X z i Á k (x (i) ) 2 0 k i j k i Therefore, K is semi-definite

Valid (Mercer) Kernel Theorem (Mercer) Let K : R n R n 7! R be given. Then for K to be a valid (Mercer) kernel, it is necessary and sufficient that for any fx (1) ;:::;x (m) g; m < 1, the corresponding kernel matrix is symmetric positive semi-definite. Example valid kernels RBF kernel Simple polynomial kernel kx zk2 K(x; z) =exp μ 2¾ 2 K(x; z) =(x > z) d Cosine similarity kernel K(x; z) = x> z James Mercer UK Mathematician 1883-1932 kxk kzk

Sigmoid Kernel K(x; z) = tanh( x > z + c) tanh(b) = 1 e 2b 1+e 2b Neural networks use sigmoid as activation function SVM with a sigmoid kernel is equivalent to a 2-layer perceptron (We shall return to this after the study of neural networks) Lin, Hsuan-Tien, and Chih-Jen Lin. "A study on sigmoid kernels for SVM and the training of non-psd kernels by SMO-type methods." 2003.

Generalized Linear Models

Review: Linear Regression 2 X = 6 4 x (1) x (2). x (n) 3 Prediction 2 7 5 = 6 4 3 ::: x (1) 3 d 3 ::: x (2) d...... 7 μ. 5 = 3 ::: x (n) d 2 x (1) 3 μ x (2) μ ^y = Xμ = 6 7 4. 5 x (n) μ x (1) 1 x (1) 2 x (1) x (2) 1 x (2) 2 x (2) x (n) 1 x (n) 2 x (n) 2 μ 1 μ 2 6 4. μ d 3 7 y 5 = 2 6 4 y 1 y 2. y n 3 7 5 Objective J(μ) = 1 2 (y ^y)> (y ^y) = 1 2 (y Xμ)> (y Xμ)

Review: Matrix Form of Linear Reg. Objective J(μ) = 1 2 (y Xμ)> (y Xμ) minj(μ) Gradient @J(μ) @μ = X > (y Xμ) Solution @J(μ) @μ = 0! X > (y Xμ) =0! X > y = X > Xμ! ^μ =(X > X) 1 X > y

Generalized Linear Models Dependence y = f(μ > Á(x)) Feature mapping function Mapped feature matrix 2 Á(x (1) 3 2 ) Á 1 (x (1) ) Á 2 (x (1) ) Á h (x (1) 3 ) Á(x (2) ) Á 1 (x (2) ) Á 2 (x (2) ) Á h (x (2) ) =.. Á(x (i) ) =..... Á 1 (x (i) ) Á 2 (x (i) ) Á h (x (i) ) 6 7 6 4. 5 4. 7..... 5 Á(x (n) ) Á 1 (x (n) ) Á 2 (x (n) ) Á h (x (n) )

Matrix Form of Kernel Linear Regression Objective J(μ) = 1 2 (y μ)> (y μ) minj(μ) Gradient Solution @J(μ) @μ @J(μ) = > (y μ) @μ = 0! > (y μ) =0! > y = > μ! ^μ =( > ) 1 > y

Matrix Form of Kernel Linear Regression With the Algebra trick (P 1 + B > R 1 B) 1 B > R 1 = PB > (BPB > + R) 1 The optimal parameters with L2 regularization ^μ =( > + I h ) 1 > y = > ( > + I n ) 1 y for prediction, we never actually need access ^y = ^μ = > ( > + I n ) 1 y = K(K + I n ) 1 y where the kernel matrix K = fk(x (i) ; x (j) )g [http://www.ics.uci.edu/~welling/classnotes/papers_class/kernel-ridge.pdf]