Support Vector Machines and Kernel Methods

Size: px

Start display at page:

Download "Support Vector Machines and Kernel Methods"

Anissa York
6 years ago
Views:

1 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

2 Linear Classification For linear separable cases, we have multiple decision boundaries

3 Linear Classification For linear separable cases, we have multiple decision boundaries Ruling out some separators by considering data noise

4 Linear Classification For linear separable cases, we have multiple decision boundaries The intuitive optimal decision boundary: the largest margin

L(y;x; p μ )= ylog ¾(μ > x) (1 y)log(1 ¾(μ > x)) ¾(x) @L(y; x; p μ ) 1 = y @μ ¾(μ > x) ¾(z)(1

5 Review: Logistic Regression Logistic regression is a binary classification model p μ (y =1jx) =¾(μ > x)= p μ (y =0jx) = e μ> x 1+e μ> x 1 1+e μ> x Cross entropy loss function Gradient L(y;x; p μ )= ylog ¾(μ > x) (1 y)log(1 ¾(μ > x)) x; p μ ) 1 = ¾(μ > x) ¾(z)(1 ¾(z))x (1 y) 1 1 ¾(μ > ¾(z)(1 ¾(z))x x) =(¾(μ > x) μ Ã μ + (y ¾(μ > x))x = ¾(z)(1 x

6 Label Decision Logistic regression provides the probability p μ (y =1jx) =¾(μ > x)= p μ (y =0jx) = e μ> x 1+e μ> x The final label of an instance is decided by setting a threshold h ^y = ( 1; p μ (y =1jx) >h 0; otherwise 1 1+e μ> x

7 Logistic Regression Scores x 2 s(x) =μ 0 + μ 1 x 1 + μ 2 x 2 p μ (y =1jx) = 1 1+e s(x) s(x) =μ 0 + μ 1 x (A) 1 + μ 2 x (A) 2 Decision boundary s(x) =μ 0 + μ 1 x (B) 1 + μ 2 x (B) 2 s(x) =μ 0 + μ 1 x (C) 1 + μ 2 x (C) 2 s(x) =0 The higher score, the larger distance to the decision boundary, the higher confidence x 1 Example from Andrew Ng

8 Linear Classification The intuitive optimal decision boundary: the highest confidence

9 Notations for SVMs Feature vector Class label Parameters Intercept b Feature weight vector Label prediction x y 2f 1; 1g w h w;b (x) =g(w > x + b) ( +1 z 0 g(z) = 1 otherwise

10 Logistic Regression Scores x 2 s(x) =b + w 1 x 1 + w 2 x 2 p μ (y =1jx) = 1 1+e s(x) s(x) =b + w 1 x (A) 1 + w 2 x (A) 2 Decision boundary s(x) =b + w 1 x (B) 1 + w 2 x (B) 2 s(x) =b + w 1 x (C) 1 + w 2 x (C) 2 s(x) =0 The higher score, the larger distance to the separating hyperplane, the higher confidence x 1 Example from Andrew Ng

11 Margins x 2 Functional margin ^ (i) = y (i) (w > x (i) + b) (B) w Note that the separating hyperplane won t change with the magnitude of (w, b) g(w > x + b) =g(2w > x +2b) Decision boundary Geometric margin (i) = y (i) (w > x (i) + b) where kwk 2 =1 x 1

12 Margins x 2 Decision boundary μ w > x (i) (i) y (i) w kwk + b =0 (B) w ) (i) = y (i) w> x (i) + b kwk μ ³ w >x = y (i) (i) + kwk b kwk Given a training set Decision boundary x 1 S = f(x i ;y i )g :::m the smallest geometric margin = min :::m (i)

13 Objective of an SVM Find a separable hyperplane that maximizes the minimum geometric margin max ;w;b s.t. y (i) (w > x (i) + b) ; i =1;:::;m kwk =1 (non-convex constraint) Equivalent to normalized functional margin max ^ ;w;b ^ kwk (non-convex objective) s.t. y (i) (w > x (i) + b) ^ ; i =1;:::;m

14 Objective of an SVM Functional margin scales w.r.t. (w,b) without changing the decision boundary. Let s fix the functional margin at 1. Objective is written as max w;b Equivalent with 1 kwk ^ =1 s.t. y (i) (w > x (i) + b) 1; ;:::;m min w;b 1 2 kwk2 s.t. y (i) (w > x (i) + b) 1; ;:::;m This optimization problem can be efficiently solved by quadratic programming

15 A Digression of Lagrange Duality in Convex Optimization Boyd, Stephen, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

16 Lagrangian for Convex Optimization A convex optimization problem min w f(w) s.t. h i (w) =0; i =1;:::;l The Lagrangian of this problem is defined as Solving L(w; ) =f(w)+ lx ih i ) Lagrangian multipliers yields the solution of the original optimization problem.

17 Lagrangian for Convex Optimization w 2 f(w) =5 f(w) =4 f(w) =3 h(w) =0 L(w; ) =f(w)+ @w w =0 i.e., two gradients on the same direction

18 With Inequality Constraints A convex optimization problem min w f(w) s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l The Lagrangian of this problem is defined as L(w; ; ) =f(w)+ kx i g i (w)+ lx ih i (w) Lagrangian multipliers

19 Primal Problem A convex optimzation min w f(w) s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l The primal problem The Lagrangian kx L(w; ; ) =f(w)+ i g i (w)+ lx ih i (w) μ P P(w) = max ; : i 0 L(w; ; ) If a given w violates any constraints, i.e., Then μ P (w) =+1 g i (w) > 0 or h i (w) 6= 0

20 Primal Problem A convex optimzation min w f(w) s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l The primal problem The Lagrangian kx L(w; ; ) =f(w)+ i g i (w)+ lx ih i (w) μ P P(w) = max ; : i 0 L(w; ; ) Conversely, if all constraints are satisfied for Then μ P (w) =f(w) μ P (w) = ( f(w) w if w satis es primal constraints +1 otherwise

21 Primal Problem μ P (w) = ( f(w) if w satis es primal constraints +1 otherwise The minimization problem min μ P(w) =min w w max ; : i 0 L(w; ; ) is the same as the original problem min f(w) w s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l Define the value of the primal problem p =min μ P(w) w

22 Dual Problem A slightly different problem Define the dual optimization problem Min & Max exchanged compared to the primal problem min μ P(w) =min w w max ; : i 0 Define the value of the dual problem d d = max min ; : i 0 w L(w; ; ) L(w; ; )

23 Primal Problem vs. Dual Problem d = max min ; : i 0 w L(w; ; ) min w max ; : i 0 L(w; ; ) =p Proof ) max min L(w; ; ) L(w; ; ); 8w; 0; w min ; : 0 w ) max min ; : 0 w L(w; ; ) max ; : 0 L(w; ; ) min w L(w; ; ); 8w max ; : 0 L(w; ; ) But under certain condition d = p

24 Karush-Kuhn-Tucker (KKT) Conditions If f and g i s are convex and h i s are affine, and suppose g i s are all strictly feasible then there must exist w *, α *, β * w * is the solution of the primal problem α *, β * are the solutions of the dual problem and the values of the two problems are equal And w *, α *, β * satisfy the KKT conditions KKT dual complementarity condition Moreover, if some w *, α *, β * satisfy the KKT conditions, then it is also a solution to the primal and dual problems. More details please refer to Boyd Convex optimization 2004.

25 Now Back to SVM Problem

26 Objective of an SVM SVM objective: finding the optimal margin classifier min w;b 1 2 kwk2 s.t. y (i) (w > x (i) + b) 1; ;:::;m Re-wright the constraints as g i (w) = y (i) (w > x (i) + b)+1 0 so as to match the standard optimization form min w f(w) s.t. g i (w) 0; h i (w) =0; i =1;:::;k i =1;:::;l

27 Equality Cases g i (w) = y (i) (w > x (i) + b)+1=0 )y (i)³ w > x (i) kwk + b = 1 kwk kwk Geometric margin The g i s = 0 cases correspond to the training examples that have functional margin exactly equal to 1.

28 Objective of an SVM SVM objective: finding the optimal margin classifier min w;b 1 2 kwk2 s.t. y (i) (w > x (i) + b)+1 0; i =1;:::;m Lagrangian L(w; b; ) = 1 2 kwk2 i [y (i) (w > x (i) + b) 1] No β or equality constraints in SVM problem

29 Solving L(w; b; ) = 1 2 kwk2 L(w; b; ) =w X i y (i) x (i) =0 ) w = L(w; b; ) = 1 2 L(w; b; ) = X i y (i) =0 i [y (i) (w > x (i) + b) 1] Then Lagrangian is re-written as i y (i) x (i) i y (i) x (i) 2 X m i [y (i) (w > x (i) + b) 1] i 1 2 i;j=1 y (i) y (j) i j x (i)> x (j) b i y (i) = 0

30 Solving α * Dual problem max max μ D( ) =max 0 W ( ) = i 1 2 s.t. i 0; ;:::;m i y (i) =0 min 0 w;b i;j=1 L(w; b; ) To solve α * with some methods e.g. SMO We will get back to this solution later y (i) y (j) i j x (i)> x (j)

31 Solving w * and b * With α * solved, w * is obtained by w = i y (i) x (i) Only supporting vectors with α > 0 With w * solved, b * is obtained by b = max i:y (i) = 1 w > x (i) +min i:y (i) =1 w > x (i) 2 w

32 Predicting Values With the solutions of w * and b *, the predicting value (i.e. functional margin) of each instance is w > x + b = = ³ X m i y (i) x (i) > x + b i y (i) hx (i) ;xi + b We only need to calculate the inner product of x with the supporting vectors

33 Non-Separable Cases The derivation of the SVM as presented so far assumes that the data is linearly separable. More practical cases are linearly non-separable.

34 Dealing with Non-Separable Cases Add slack variables Lagrangian L(w; b;»; ; r) = 1 2 w> w + C Dual problem max W ( ) = min w;b 1 2 kwk2 + C» i L1 regularization s.t. y (i) (w > x (i) + b) 1» i ;;:::;m» i 0; i =1;:::;m» i i [y (i) (x > w + b) 1+» i ] r i» i i 1 2 y (i) y (j) i j x (i)> x (j) i;j=1 s.t. 0 i C; i =1;:::;m i y (i) =0 Surprisingly, this is the only change Efficiently solved by SMO algorithm

35 SVM Hinge Loss vs. LR Loss SVM Hinge loss 1 2 kwk2 + C max(0; 1 y i (w > x i + b)) LR log loss y i log ¾(w > x i + b) (1 y i )log(1 ¾(w > x i + b)) If y = 1

36 Coordinate Ascent (Descent) For the optimization problem max W ( 1; 2 ;:::; m ) Coordinate ascent algorithm Loop until convergence: f For i =1;:::;m f i := arg max W ( 1 ;:::; i 1 ; ^ i ; i+1 ;:::; m ) ^ i g g

37 Coordinate Ascent (Descent) A two-dimensional coordinate ascent example

38 SMO Algorithm SMO: sequential minimal optimization SVM optimization problem max W ( ) = i 1 2 y (i) y (j) i j x (i)> x (j) i;j=1 s.t. 0 i C; i =1;:::;m i y (i) =0 Cannot directly apply coordinate ascent algorithm because i y (i) =0 ) i y (i) = X j y (j) j6=i

39 SMO Algorithm Update two variable each time Loop until convergence { 1. Select some pair α i and α j to update next 2. Re-optimize W(α) w.r.t. α i and α j } Convergence test: whether the change of W(α) is smaller than a predefined value (e.g. 0.01) Key advantage of SMO algorithm is the update of α i and α j (step 2) is efficient

40 SMO Algorithm max W ( ) = i 1 2 i;j=1 s.t. 0 i C; i =1;:::;m i y (i) =0 Without loss of generality, hold α 3 α m and optimize W(α) w.r.t. α 1 and α 2 1 y (1) + 2 y (2) = y (i) y (j) i j x (i)> x (j) i y (i) = ³ i=3 ) 2 = y(1) y (2) 1 + ³ y (2) 1 =(³ 2 y (2) )y (1)

41 SMO Algorithm With 1 =(³ 2 y (2) )y (1), the objective is written as W ( 1 ; 2 ;:::; m )=W((³ 2 y (2) )y (1) ; 2 ;:::; m ) Thus the original optimization problem max W ( ) = i 1 2 y (i) y (j) i j x (i)> x (j) i;j=1 s.t. 0 i C; i =1;:::;m i y (i) =0 is transformed into a quadratic optimization problem w.r.t. 2 max W ( 2 )=a b 2 + c 2 s.t. 0 2 C

42 SMO Algorithm Optimizing a quadratic function is much efficient W ( 2 ) W ( 2 ) W ( 2 ) 0 b 2a C 2 0 C 2 0 C 2 max W ( 2 )=a b 2 + c 2 s.t. 0 2 C

43 Kernel Methods

44 Non-Separable Cases More practical cases are linearly non-separable. May be solved by slack variables Linearly separable case Cannot be solved by slack variables

45 Non-Separable Cases More practical cases are linearly non-separable. Solution: mapping feature vectors to a higher-dimensional space An example Á(x) Á(x) =x 2 x x

46 Non-Separable Cases More generally, mapping feature vectors to a different space Á(x) Figures from Stuart Russell

47 Feature Mapping Functions SVM only cares about the inner products W ( ) = i 1 2 y (i) y (j) i j x (i)> x (j) i;j=1 With the feature mapping function Á(x) W ( ) = i 1 2 y (i) y (j) i j K(x (i) ;x (j) ) i;j=1 Kernel K(x (i) ;x (j) )=Á(x (i) ) > Á(x (j) )

48 Kernel With the example feature mapping function 2 3 x Á(x) = 4x 2 5 The corresponding kernel is x 2 x 3 K(x (i) ;x (j) )=Á(x (i) ) > Á(x (j) ) = x (i) x (j) + x (i)2 x (j)2 + x (i)3 x (j)3 [Kernel Trick] For lots of cases, we only need K(x (i) ;x (j) ), thus we can directly define K(x (i) ;x (j) ) without explicitly defining For example, suppose K(x (i) ;x (j) ) Á(x (i) ) x (i) ;x (j) 2 R n K(x (i) ;x (j) )=(x (i)> x (j) ) 2

49 Kernel Example For example, suppose x (i) ;x (j) 2 R n K(x (i) ;x (j) )=(x (i)> x (j) ) 2 ³ nx = = = k=1 nx x (i) k x(j) k nx k=1 l=1 nx ³ nx l=1 x (i) k x(j) k x(i) l x (j) l (x (i) k x(i) l )(x (j) k x(j) l ) k;l=1 x (i) l x (j) l If n = 3, the mapping function is ) Á(x) = Note that calculating Á(x) takes O(n 2 ) time, while calculating K(x (i) ;x (j) ) only takes O(n) time 2 3 x 1 x 1 x 1 x 2 x 1 x 3 x 2 x 1 x 2 x 2 x 2 x 3 6x 3 x x 3 x 2 5 x 3 x 3

50 Kernel for Measuring Similarity Intuitively, for two instances x and z, if Á(x) and are close together, then we expect K(x; z) =Á(x) > Á(z) to be large, and vice versa. Gaussian kernel (a very widely used kernel) kx zk2 K(x; z) =exp μ 2¾ 2 Á(z) Also called radial basis function (RBF) kernel Then what is the feature mapping function for this kernel?

51 Kernel Matrix Consider a finite set of instances The corresponding Kernel Matrix K is defined as The kernel matrix K must be symmetric since fx (1) ;:::;x (m) g fk ij g i;j=1;:::;m K ij = K(x (i) ;x (j) )=Á(x (i) ) > Á(x (j) )=Á(x (j) ) > Á(x (i) )=K(x (j) ;x (i) )=K ji If we define Á k (x) as the k-th coordinate of the vector Á(x), then for any vector z 2 R m, we have z > Kz = X X z i K ij z j i j = X X z i Á(x (i) ) > Á(x (j) )z j = X X X z i Á k (x (i) )Á k (x (j) )z j i j i j k = X X X z i Á k (x (i) )Á k (x (j) )z j = X ³ X z i Á k (x (i) ) 2 0 k i j k i Therefore, K is semi-definite

52 Valid (Mercer) Kernel Theorem (Mercer) Let K : R n R n 7! R be given. Then for K to be a valid (Mercer) kernel, it is necessary and sufficient that for any fx (1) ;:::;x (m) g; m < 1, the corresponding kernel matrix is symmetric positive semi-definite. Example valid kernels RBF kernel Simple polynomial kernel kx zk2 K(x; z) =exp μ 2¾ 2 K(x; z) =(x > z) d Cosine similarity kernel K(x; z) = x> z James Mercer UK Mathematician kxk kzk

53 Sigmoid Kernel K(x; z) = tanh( x > z + c) tanh(b) = 1 e 2b 1+e 2b Neural networks use sigmoid as activation function SVM with a sigmoid kernel is equivalent to a 2-layer perceptron (We shall return to this after the study of neural networks) Lin, Hsuan-Tien, and Chih-Jen Lin. "A study on sigmoid kernels for SVM and the training of non-psd kernels by SMO-type methods." 2003.

54 Generalized Linear Models

55 Review: Linear Regression 2 X = 6 4 x (1) x (2). x (n) 3 Prediction = ::: x (1) 3 d 3 ::: x (2) d μ. 5 = 3 ::: x (n) d 2 x (1) 3 μ x (2) μ ^y = Xμ = x (n) μ x (1) 1 x (1) 2 x (1) x (2) 1 x (2) 2 x (2) x (n) 1 x (n) 2 x (n) 2 μ 1 μ μ d 3 7 y 5 = y 1 y 2. y n Objective J(μ) = 1 2 (y ^y)> (y ^y) = 1 2 (y Xμ)> (y Xμ)

56 Review: Matrix Form of Linear Reg. Objective J(μ) = 1 2 (y Xμ)> (y Xμ) = X > (y = 0! X > (y Xμ) =0! X > y = X > Xμ! ^μ =(X > X) 1 X > y

57 Generalized Linear Models Dependence y = f(μ > Á(x)) Feature mapping function Mapped feature matrix 2 Á(x (1) 3 2 ) Á 1 (x (1) ) Á 2 (x (1) ) Á h (x (1) 3 ) Á(x (2) ) Á 1 (x (2) ) Á 2 (x (2) ) Á h (x (2) ) =.. Á(x (i) ) =..... Á 1 (x (i) ) Á 2 (x (i) ) Á h (x (i) ) Á(x (n) ) Á 1 (x (n) ) Á 2 (x (n) ) Á h (x (n) )

58 Matrix Form of Kernel Linear Regression Objective J(μ) = 1 2 (y μ)> (y μ) minj(μ) @J(μ) = > (y = 0! > (y μ) =0! > y = > μ! ^μ =( > ) 1 > y

59 Matrix Form of Kernel Linear Regression With the Algebra trick (P 1 + B > R 1 B) 1 B > R 1 = PB > (BPB > + R) 1 The optimal parameters with L2 regularization ^μ =( > + I h ) 1 > y = > ( > + I n ) 1 y for prediction, we never actually need access ^y = ^μ = > ( > + I n ) 1 y = K(K + I n ) 1 y where the kernel matrix K = fk(x (i) ; x (j) )g [

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

2018 EE448, Big Data Mining, Lecture 5 Supervised Learning (Part II) Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html Content of Supervised Learning