Support Vector Machines, Kernel SVM

Size: px

Start display at page:

Download "Support Vector Machines, Kernel SVM"

Lenard Stevenson
5 years ago
Views:

1 Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

2 Outline 1 Administration 2 Review of last lecture 3 SVM Hinge loss (primal formulation) 4 Kernel SVM Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

3 Announcements HW4 due now HW5 will be posted online today Midterm has been graded Average: 64.6/90 Median: 64.5/90 Standard Deviation: 14.8 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

4 Outline 1 Administration 2 Review of last lecture SVMs Geometric interpretation 3 SVM Hinge loss (primal formulation) 4 Kernel SVM Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

5 SVM Intuition: where to put the decision boundary? Consider the following separable training dataset, i.e., we assume there exists a decision boundary that separates the two classes perfectly. There are an infinite number of decision boundaries H : w T φ(x) + b = 0! H H H Which one should we pick? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

6 SVM Intuition: where to put the decision boundary? Consider the following separable training dataset, i.e., we assume there exists a decision boundary that separates the two classes perfectly. There are an infinite number of decision boundaries H : w T φ(x) + b = 0! H H H Which one should we pick? Idea: Find a decision boundary in the middle of the two classes. In other words, we want a decision boundary that: Perfectly classifies the training data Is as far away from every training point as possible Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

7 Distance from a point to decision boundary The unsigned distance from a point φ(x) to decision boundary (hyperplane) H is d H (φ(x)) = wt φ(x) + b w 2 We can remove the absolute value by exploiting the fact that the decision boundary classifies every point in the training dataset correctly. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

8 Distance from a point to decision boundary The unsigned distance from a point φ(x) to decision boundary (hyperplane) H is d H (φ(x)) = wt φ(x) + b w 2 We can remove the absolute value by exploiting the fact that the decision boundary classifies every point in the training dataset correctly. Namely, (w T φ(x) + b) and x s label y must have the same sign, so: d H (φ(x)) = y[wt φ(x) + b] w 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

9 Optimizing the Margin Margin Smallest distance between the hyperplane and all training points margin(w, b) = min n y n [w T φ(x n ) + b] w 2 H : w T φ(x)+b =0 w T φ(x)+b w 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

10 Optimizing the Margin Margin Smallest distance between the hyperplane and all training points margin(w, b) = min n y n [w T φ(x n ) + b] w 2 H : w T φ(x)+b =0 w T φ(x)+b w 2 How should we pick (w, b) based on its margin? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

11 Optimizing the Margin Margin Smallest distance between the hyperplane and all training points margin(w, b) = min n y n [w T φ(x n ) + b] w 2 H : w T φ(x)+b =0 w T φ(x)+b w 2 How should we pick (w, b) based on its margin? We want a decision boundary that is as far away from all training points as possible, so we to maximize the margin! max w,b y n [w T φ(x n ) + b] 1 min = max min y n [w T φ(x n ) + b] n w w,b w 2 n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

12 Rescaled Margin We can further constrain the problem by scaling (w, b) such that min n y n [w T φ(x n ) + b] = 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

13 Rescaled Margin We can further constrain the problem by scaling (w, b) such that min n y n [w T φ(x n ) + b] = 1 We ve fixed the numerator in the margin(w, b) equation, and we have: margin(w, b) = 1 w 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

14 Rescaled Margin We can further constrain the problem by scaling (w, b) such that min n y n [w T φ(x n ) + b] = 1 We ve fixed the numerator in the margin(w, b) equation, and we have: margin(w, b) = 1 w 2 Hence the points closest to the decision boundary are at distance 1! w T φ(x)+b =1 H : w T φ(x)+b =0 1 w2 w T φ(x)+b = 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

15 SVM: max margin formulation for separable data Assuming separable training data, we thus want to solve: max w,b This is equivalent to 1 w 2 such that y n [w T φ(x n ) + b] 1, n min w,b 1 2 w 2 2 s.t. y n [w T φ(x n ) + b] 1, n Given our geometric intuition, SVM is called a max margin (or large margin) classifier. The constraints are called large margin constraints. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

16 SVM for non-separable data Constraints in separable setting y n [w T φ(x n ) + b] 1, n Constraints in non-separable setting Idea: modify our constraints to account for non-separability! Specifically, we introduce slack variables ξ n 0: y n [w T φ(x n ) + b] 1 ξ n, n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

17 SVM for non-separable data Constraints in separable setting y n [w T φ(x n ) + b] 1, n Constraints in non-separable setting Idea: modify our constraints to account for non-separability! Specifically, we introduce slack variables ξ n 0: y n [w T φ(x n ) + b] 1 ξ n, n For hard training points, we can increase ξ n until the above inequalities are met What does it mean when ξ n is very large? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

18 Soft-margin SVM formulation We do not want ξ n to grow too large, and we can control their size by incorporating them into our optimization problem: min w,b,ξ 1 2 w C n ξ n s.t. y n [w T φ(x n ) + b] 1 ξ n, ξ n 0, n n C is user-defined regularization hyperparameter that trades off between the two terms in our objective This is a convex quadratic program that can be solved with general purpose or specialized solvers Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

19 What are support vectors? Support vectors are data points where the margin inequality constraint is active (i.e., an equality): 3 types of SVs ξ n = 0. 1 ξ n y n [w T φ(x n ) + b]} = 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

20 What are support vectors? Support vectors are data points where the margin inequality constraint is active (i.e., an equality): 3 types of SVs 1 ξ n y n [w T φ(x n ) + b]} = 0 ξ n = 0. This implies y n [w T φ(x n ) + b] = 1. These are points that are on the margin ξ n < 1. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

21 What are support vectors? Support vectors are data points where the margin inequality constraint is active (i.e., an equality): 3 types of SVs 1 ξ n y n [w T φ(x n ) + b]} = 0 ξ n = 0. This implies y n [w T φ(x n ) + b] = 1. These are points that are on the margin ξ n < 1. These are points that can be classified correctly but do not satisfy the large margin constraint, i.e., distance to the hyperplane is less than 1 ξ n > 1. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

22 What are support vectors? Support vectors are data points where the margin inequality constraint is active (i.e., an equality): 3 types of SVs 1 ξ n y n [w T φ(x n ) + b]} = 0 ξ n = 0. This implies y n [w T φ(x n ) + b] = 1. These are points that are on the margin ξ n < 1. These are points that can be classified correctly but do not satisfy the large margin constraint, i.e., distance to the hyperplane is less than 1 ξ n > 1. These are points that are misclassified. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

23 Visualization of how training data points are categorized w T (x)+b =1 H : w T (x)+b =0 n < 1 n > 1 n =0 w T (x)+b = 1 The SVM solution solution is only determined by a subset of the training samples (as we will see later in the lecture) These samples are called support vectors, which are highlighted by the dotted orange lines in the figure Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

24 Outline 1 Administration 2 Review of last lecture 3 SVM Hinge loss (primal formulation) 4 Kernel SVM Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

25 Hinge loss Definition Assume y { 1, 1} and the decision rule is h(x) = sign(f(x)) with f(x) = w T φ(x) + b, Intuition { l hinge (f(x), y) = 0 if yf(x) 1 1 yf(x) otherwise Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

26 Hinge loss Definition Assume y { 1, 1} and the decision rule is h(x) = sign(f(x)) with f(x) = w T φ(x) + b, Intuition { l hinge (f(x), y) = 0 if yf(x) 1 1 yf(x) otherwise No penalty if raw output, f(x), has same sign and is far enough from decision boundary (i.e., if margin is large enough) Otherwise pay a growing penalty, between 0 and 1 if signs match, and greater than one otherwise Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

27 Hinge loss Definition Assume y { 1, 1} and the decision rule is h(x) = sign(f(x)) with f(x) = w T φ(x) + b, Intuition { l hinge (f(x), y) = 0 if yf(x) 1 1 yf(x) otherwise No penalty if raw output, f(x), has same sign and is far enough from decision boundary (i.e., if margin is large enough) Otherwise pay a growing penalty, between 0 and 1 if signs match, and greater than one otherwise Convenient shorthand l hinge (f(x), y) = max(0, 1 yf(x)) = (1 yf(x)) + Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

28 Visualization and Properties Upper-bound for 0/1 loss function (black line) We use hinge loss is a surrogate to 0/1 loss Why? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

29 Visualization and Properties Upper-bound for 0/1 loss function (black line) We use hinge loss is a surrogate to 0/1 loss Why? Hinge loss is convex, and thus easier to work with (though it s not differentiable at kink) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

30 Visualization and Properties Other surrogate losses can be used, e.g., exponential loss for Adaboost (in blue), logistic loss (not shown) for logistic regression Hinge loss less sensitive to outliers than exponential (or logistic) loss Logistic loss has a natural probabilistic interpretation We can greedily optimize exponential loss (Adaboost) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

31 Primal formulation of support vector machines (SVM) Minimizing the total hinge loss on all the training data min w,b max(0, 1 y n [w T φ(x n ) + b]) + λ 2 w 2 2 n Analogous to regularized least squares, as we balance between two terms (the loss and the regularizer). Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

32 Primal formulation of support vector machines (SVM) Minimizing the total hinge loss on all the training data min w,b max(0, 1 y n [w T φ(x n ) + b]) + λ 2 w 2 2 n Analogous to regularized least squares, as we balance between two terms (the loss and the regularizer). Previously, we used geometric arguments to derive: min w,b,ξ 1 2 w C n ξ n s.t. y n [w T φ(x n ) + b] 1 ξ n and ξ n 0, n Do these the yield the same solution? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

33 Recovering our previous SVM formulation Define C = 1/λ: min w,b C n max(0, 1 y n [w T φ(x n ) + b]) w 2 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

34 Recovering our previous SVM formulation Define C = 1/λ: min w,b C n max(0, 1 y n [w T φ(x n ) + b]) w 2 2 Define ξ n max(0, 1 y n f(x n )) min w,b,ξ C n ξ n w 2 2 s.t. max(0, 1 y n [w T φ(x n ) + b]) ξ n, n Since c max(a, b) c a, c b, we recover previous formulation Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

35 Outline 1 Administration 2 Review of last lecture 3 SVM Hinge loss (primal formulation) 4 Kernel SVM Lagrange duality theory SVM Dual Formulation and Kernel SVM SVM Dual Derivation and Support Vectors Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

36 Kernel SVM Roadmap Key concepts we ll cover Brief review of constrained optimization with inequality constraints Primal and Dual problems Strong Duality and KKT conditions Dual SVM problem and Kernel SVM Dual SVM problem and support vectors Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

37 Constrained Optimization Equality Constraints The Lagrangian is defined as follows: min x f(x) s.t. h j (x) = 0, j L(x, β) = f(x) + j β j h j (x) When problem is convex, we can find the optimal solution by Computing partial derivatives of L Setting them to zero Solving the corresponding system of equations Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

38 Constrained Optimization Inequality Constraints This is the primal problem min x f(x) s.t. g i (x) 0, i h i (x) = 0, j Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

39 Constrained Optimization Inequality Constraints min x f(x) s.t. g i (x) 0, i h i (x) = 0, j This is the primal problem with the generalized Lagrangian: L(x, α, β) = f(x) + i α i g i (x) + j β j h j (x) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

40 Constrained Optimization Inequality Constraints min x f(x) s.t. g i (x) 0, i h i (x) = 0, j This is the primal problem with the generalized Lagrangian: L(x, α, β) = f(x) + i α i g i (x) + j β j h j (x) Consider the following function: θ P (x) = max L(x, α, β) α,β,α i 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

41 Constrained Optimization Inequality Constraints min x f(x) s.t. g i (x) 0, i h i (x) = 0, j This is the primal problem with the generalized Lagrangian: L(x, α, β) = f(x) + i α i g i (x) + j β j h j (x) Consider the following function: θ P (x) = max L(x, α, β) α,β,α i 0 If x violates a primal constraint, θ P (x) = ; otherwise θ P (x) = f(x) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

42 Constrained Optimization Inequality Constraints min x f(x) s.t. g i (x) 0, i h i (x) = 0, j This is the primal problem with the generalized Lagrangian: L(x, α, β) = f(x) + i α i g i (x) + j β j h j (x) Consider the following function: θ P (x) = max L(x, α, β) α,β,α i 0 If x violates a primal constraint, θ P (x) = ; otherwise θ P (x) = f(x) Thus min x θ P (x) = min x max α,β,αi 0 L(x, α, β) has same solution as Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

43 Constrained Optimization Inequality Constraints min x f(x) s.t. g i (x) 0, i h i (x) = 0, j This is the primal problem with the generalized Lagrangian: L(x, α, β) = f(x) + i α i g i (x) + j β j h j (x) Consider the following function: θ P (x) = max L(x, α, β) α,β,α i 0 If x violates a primal constraint, θ P (x) = ; otherwise θ P (x) = f(x) Thus min x θ P (x) = min x max α,β,αi 0 L(x, α, β) has same solution as primal problem, which we denote as p Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

44 Constrained Optimization Inequality Constraints Primal Problem Dual Problem p = min x θ P (x) = min x max α,β,α i 0 Consider the function: θ D (α, β) = min x L(x, α, β) L(x, α, β) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

45 Constrained Optimization Inequality Constraints Primal Problem Dual Problem p = min x θ P (x) = min x max α,β,α i 0 Consider the function: θ D (α, β) = min x L(x, α, β) d = max α,β,α i 0 θ D(α, β) = L(x, α, β) max min L(x, α, β) α,β,α i 0 x Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

46 Constrained Optimization Inequality Constraints Primal Problem Dual Problem p = min x θ P (x) = min x max α,β,α i 0 Consider the function: θ D (α, β) = min x L(x, α, β) d = max α,β,α i 0 θ D(α, β) = L(x, α, β) max min L(x, α, β) α,β,α i 0 x Primal and dual are the same, except the max and min are exchanged! Relationship between primal and dual? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

47 Constrained Optimization Inequality Constraints Primal Problem Dual Problem p = min x θ P (x) = min x max α,β,α i 0 Consider the function: θ D (α, β) = min x L(x, α, β) d = max α,β,α i 0 θ D(α, β) = L(x, α, β) max min L(x, α, β) α,β,α i 0 x Primal and dual are the same, except the max and min are exchanged! Relationship between primal and dual? p d (weak duality) min max of any function is always greater than the max min Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

48 Strong Duality When p = d, we can solve the dual problem in lieu of primal problem! Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

49 Strong Duality When p = d, we can solve the dual problem in lieu of primal problem! Sufficient conditions for strong duality: f and g i are convex, h i are affine (i.e., linear with offset) Inequality constraints are strictly feasible, i.e., there exists some x such that g i (x) < 0 for all i Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

50 Strong Duality When p = d, we can solve the dual problem in lieu of primal problem! Sufficient conditions for strong duality: f and g i are convex, h i are affine (i.e., linear with offset) Inequality constraints are strictly feasible, i.e., there exists some x such that g i (x) < 0 for all i These conditions are all satisfied by the SVM optimization problem! Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

51 Strong Duality When p = d, we can solve the dual problem in lieu of primal problem! Sufficient conditions for strong duality: f and g i are convex, h i are affine (i.e., linear with offset) Inequality constraints are strictly feasible, i.e., there exists some x such that g i (x) < 0 for all i These conditions are all satisfied by the SVM optimization problem! When these conditions hold, there must exist x, α, β such that: x is the solution to the primal and α, β is the solution to the dual p = d = L(x, α, β ) x, α, β satisfy the KKT conditions (which in fact are necessary and sufficient) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

52 Recap When working with constrained optimization problems with inequality constraints, we can write down primal and dual problems The dual solution is always a lower bound on the primal solution (weak duality) The duality gap equals 0 under certain conditions (strong duality), and in such cases we can either solve the primal or dual problem Strong duality holds for the SVM problem, and in particular the KKT conditions are necessary and sufficient for the optimal solution See for details Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

53 Dual formulation of SVM Dual is also a convex quadratic programming max α α n 1 y m y n α m α n φ(x m ) T φ(x n ) 2 n m,n s.t. 0 α n C, n α n y n = 0 n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

54 Dual formulation of SVM Dual is also a convex quadratic programming max α α n 1 y m y n α m α n φ(x m ) T φ(x n ) 2 n m,n s.t. 0 α n C, n α n y n = 0 n There are N dual variable α n, one for each constraint in the primal formulation Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

55 Kernel SVM We replace the inner products φ(x m ) T φ(x n ) with a kernel function max α α n 1 y m y n α m α n k(x m, x n ) 2 n m,n s.t. 0 α n C, n α n y n = 0 n We can define a kernel function to work with nonlinear features and learn a nonlinear decision surface Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

56 Recovering solution to the primal formulation Weights w = n y n α n φ(x n ) Linear combination of the input features Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

57 Recovering solution to the primal formulation Weights w = n y n α n φ(x n ) Linear combination of the input features Offset For any C > α n > 0, we have b = [y n w T φ(x n )] Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

58 Recovering solution to the primal formulation Weights w = n y n α n φ(x n ) Linear combination of the input features Offset For any C > α n > 0, we have b = [y n w T φ(x n )] = [y n m y m α m k(x m, x n )] Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

59 Recovering solution to the primal formulation Weights w = n y n α n φ(x n ) Linear combination of the input features Offset For any C > α n > 0, we have b = [y n w T φ(x n )] = [y n m y m α m k(x m, x n )] Prediction on a test point x h(x) = sign(w T φ(x) + b) = sign( n y n α n k(x n, x) + b) At test time it suffices to know the kernel function! Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

60 Derivation of the dual We will next derive the dual formulation for SVMs. Recipe Formulate the generalized Lagrangian function that incorporates the constraints and introduces dual variables Minimize the Lagrangian function over the primal variables Substitute the primal variables for dual variables in the Lagrangian Maximize the Lagrangian with respect to dual variables Recover the solution (for the primal variables) from the dual variables Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

61 A simple example Consider the example of convex quadratic programming min 1 2 x2 s.t. x 0 2x 3 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

62 A simple example Consider the example of convex quadratic programming min 1 2 x2 s.t. x 0 2x 3 0 The generalized Lagrangian is (note that we do not have equality constraints) L(x, α) = 1 2 x2 + α 1 ( x) + α 2 (2x 3) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

63 A simple example Consider the example of convex quadratic programming min 1 2 x2 s.t. x 0 2x 3 0 The generalized Lagrangian is (note that we do not have equality constraints) L(x, α) = 1 2 x2 + α 1 ( x) + α 2 (2x 3) = 1 2 x2 + (2α 2 α 1 )x 3α 2 under the constraints that α 1 0 and α 2 0. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

64 A simple example Consider the example of convex quadratic programming min 1 2 x2 s.t. x 0 2x 3 0 The generalized Lagrangian is (note that we do not have equality constraints) L(x, α) = 1 2 x2 + α 1 ( x) + α 2 (2x 3) = 1 2 x2 + (2α 2 α 1 )x 3α 2 under the constraints that α 1 0 and α 2 0. Its dual problem is Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

65 A simple example Consider the example of convex quadratic programming min 1 2 x2 s.t. x 0 2x 3 0 The generalized Lagrangian is (note that we do not have equality constraints) L(x, α) = 1 2 x2 + α 1 ( x) + α 2 (2x 3) = 1 2 x2 + (2α 2 α 1 )x 3α 2 under the constraints that α 1 0 and α 2 0. Its dual problem is max min L(x, α) = max min 1 α 1 0,α 2 0 x α 1 0,α 2 0 x 2 x2 + (2α 2 α 1 )x 3α 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

66 Example (cont d) We now solve min x L(x, α). The optimal x is attained by ( 1 2 x2 + (2α 2 α 1 )x 3α 2 ) x = 0 x = (2α 2 α 1 ) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

67 Example (cont d) We now solve min x L(x, α). The optimal x is attained by ( 1 2 x2 + (2α 2 α 1 )x 3α 2 ) x = 0 x = (2α 2 α 1 ) We next substitute the solution back into the Lagrangian: 1 g(α) = min x 2 x2 + (2α 2 α 1 )x 3α 2 = 1 2 (2α 2 α 1 ) 2 3α 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

68 Example (cont d) We now solve min x L(x, α). The optimal x is attained by ( 1 2 x2 + (2α 2 α 1 )x 3α 2 ) x = 0 x = (2α 2 α 1 ) We next substitute the solution back into the Lagrangian: 1 g(α) = min x 2 x2 + (2α 2 α 1 )x 3α 2 = 1 2 (2α 2 α 1 ) 2 3α 2 Our dual problem can now be simplified: We will solve the dual next. max α 1 0,α (2α 2 α 1 ) 2 3α 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

69 Solving the dual Note that, g(α) = 1 2 (2α 2 α 1 ) 2 3α 2 0 for all α 1 0, α 2 0. Thus, to maximize the function, the optimal solution is Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

70 Solving the dual Note that, g(α) = 1 2 (2α 2 α 1 ) 2 3α 2 0 for all α 1 0, α 2 0. Thus, to maximize the function, the optimal solution is α 1 = 0, α 2 = 0 This brings us back the optimal solution of x x = (2α 2 α 1) = 0 Namely, we have arrived at the same solution as the one we guessed from the primal formulation Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

71 Deriving the dual for SVM Primal SVM min w,b,ξ 1 2 w C n ξ n s.t. y n [w T φ(x n ) + b] 1 ξ n, ξ n 0, n n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

72 Deriving the dual for SVM Primal SVM min w,b,ξ 1 2 w C n ξ n s.t. y n [w T φ(x n ) + b] 1 ξ n, ξ n 0, n n Lagrangian L(w, b, {ξ n }, {α n }, {λ n }) = C n ξ n w 2 2 n λ n ξ n + n α n {1 y n [w T φ(x n ) + b] ξ n } under the constraints that α n 0 and λ n 0. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

73 Minimizing the Lagrangian Taking derivatives with respect to the primal variables L w = w n L b = n α n y n = 0 y n α n φ(x n ) = 0 L ξ n = C λ n α n = 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

74 Minimizing the Lagrangian Taking derivatives with respect to the primal variables L w = w n L b = n α n y n = 0 y n α n φ(x n ) = 0 L ξ n = C λ n α n = 0 These equations link the primal variables and the dual variables and provide new constraints on the dual variables: w = n α n y n = 0 n C λ n α n = 0 y n α n φ(x n ) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

75 Rearrange the Lagrangian and incorporate these constraints Recall: L( ) = C n ξn w 2 2 n λnξn + n αn{1 yn[wt φ(x n) + b] ξ n} where α n 0 and λ n 0 Constraints from partial derivatives: n αnyn = 0 and C λn αn = 0 g({α n },{λ n }) = L(w, b, {ξ n }, {α n }, {λ n }) = n (C α n λ n )ξ n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

76 Rearrange the Lagrangian and incorporate these constraints Recall: L( ) = C n ξn w 2 2 n λnξn + n αn{1 yn[wt φ(x n) + b] ξ n} where α n 0 and λ n 0 Constraints from partial derivatives: n αnyn = 0 and C λn αn = 0 g({α n },{λ n }) = L(w, b, {ξ n }, {α n }, {λ n }) = n (C α n λ n )ξ n n y n α n φ(x n ) 2 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

77 Rearrange the Lagrangian and incorporate these constraints Recall: L( ) = C n ξn w 2 2 n λnξn + n αn{1 yn[wt φ(x n) + b] ξ n} where α n 0 and λ n 0 Constraints from partial derivatives: n αnyn = 0 and C λn αn = 0 g({α n },{λ n }) = L(w, b, {ξ n }, {α n }, {λ n }) = n (C α n λ n )ξ n n y n α n φ(x n ) n α n ( ) α n y n b n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

78 Rearrange the Lagrangian and incorporate these constraints Recall: L( ) = C n ξn w 2 2 n λnξn + n αn{1 yn[wt φ(x n) + b] ξ n} where α n 0 and λ n 0 Constraints from partial derivatives: n αnyn = 0 and C λn αn = 0 g({α n },{λ n }) = L(w, b, {ξ n }, {α n }, {λ n }) = (C α n λ n )ξ n y n α n φ(x n ) α n n n n ( ) α n y n b ( T α n y n y m α m φ(x m )) φ(x n ) n m n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

79 Rearrange the Lagrangian and incorporate these constraints Recall: L( ) = C n ξn w 2 2 n λnξn + n αn{1 yn[wt φ(x n) + b] ξ n} where α n 0 and λ n 0 Constraints from partial derivatives: n αnyn = 0 and C λn αn = 0 g({α n },{λ n }) = L(w, b, {ξ n }, {α n }, {λ n }) = (C α n λ n )ξ n y n α n φ(x n ) α n n n n ( ) α n y n b ( T α n y n y m α m φ(x m )) φ(x n ) n m n = α n y n α n φ(x n ) 2 2 α n α m y m y n φ(x m ) T φ(x n ) n n m,n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

80 Rearrange the Lagrangian and incorporate these constraints Recall: L( ) = C n ξn w 2 2 n λnξn + n αn{1 yn[wt φ(x n) + b] ξ n} where α n 0 and λ n 0 Constraints from partial derivatives: n αnyn = 0 and C λn αn = 0 g({α n },{λ n }) = L(w, b, {ξ n }, {α n }, {λ n }) = (C α n λ n )ξ n y n α n φ(x n ) α n n n n ( ) α n y n b ( T α n y n y m α m φ(x m )) φ(x n ) n m n = α n y n α n φ(x n ) 2 2 α n α m y m y n φ(x m ) T φ(x n ) n n m,n = n α n 1 α n α m y m y n φ(x m ) T φ(x n ) 2 m,n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

81 The dual problem Maximizing the dual under the constraints max α g({α n }, {λ n }) = n s.t. α n 0, n α n y n = 0 n C λ n α n = 0, λ n 0, n α n 1 y m y n α m α n k(x m, x n ) 2 n m,n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

82 The dual problem Maximizing the dual under the constraints max α g({α n }, {λ n }) = n s.t. α n 0, n α n y n = 0 n C λ n α n = 0, λ n 0, n α n 1 y m y n α m α n k(x m, x n ) 2 n We can simplify as the objective function does not depend on λ n. Specifically, we can combine the constraints involving λ n resulting in the following inequality constraint: α n C: m,n C λ n α n = 0, λ n 0 λ n = C α n 0 α n C Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

83 Simplified Dual max α α n 1 y m y n α m α n φ(x m ) T φ(x n ) 2 n m,n s.t. 0 α n C, n α n y n = 0 n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

84 Recovering solution to the primal formulation We already identified the primal variable: L w w = n α ny n φ(x n ) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

85 Recovering solution to the primal formulation We already identified the primal variable: L w w = n α ny n φ(x n ) Prediction only depends on support vectors, i.e., points with α n > 0! When does α n > 0? KKT conditions tell us: (1) λ n ξ n = 0 (2) α n {1 ξ n y n [w T φ(x n ) + b]} = 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

86 Recovering solution to the primal formulation We already identified the primal variable: L w w = n α ny n φ(x n ) Prediction only depends on support vectors, i.e., points with α n > 0! When does α n > 0? KKT conditions tell us: (1) λ n ξ n = 0 (2) α n {1 ξ n y n [w T φ(x n ) + b]} = 0 (2) tells us that α n > 0 iff 1 ξ n = y n [w T φ(x n ) + b] If ξ n = 0, then support vector is on the margin Otherwise, ξ n > 0 means that the point is an outlier Equality from derivative of Lagrangian yields: (3) C α n λ n = 0 If ξn > 0, then (1) and (3) imply that α n = C Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

87 Expressions for offset (b) and test predictions Recovering b KKT conditions: (1) λ n ξ n = 0 (2) α n {1 ξ n y n [w T φ(x n ) + b]} = 0 Equality from derivative of Lagrangian: (3) C α n λ n = 0 Combine (1) and (3) and assume α n < C: Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

88 Expressions for offset (b) and test predictions Recovering b KKT conditions: (1) λ n ξ n = 0 (2) α n {1 ξ n y n [w T φ(x n ) + b]} = 0 Equality from derivative of Lagrangian: (3) C α n λ n = 0 Combine (1) and (3) and assume α n < C: λ n = C α n > 0 ξ n = 0 Using (2), if 0 < α n < C and y n { 1, 1}: 1 y n [w T φ(x n ) + b] = 0 b = y n w T φ(x n ) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

89 Expressions for offset (b) and test predictions Recovering b KKT conditions: (1) λ n ξ n = 0 (2) α n {1 ξ n y n [w T φ(x n ) + b]} = 0 Equality from derivative of Lagrangian: (3) C α n λ n = 0 Combine (1) and (3) and assume α n < C: λ n = C α n > 0 ξ n = 0 Using (2), if 0 < α n < C and y n { 1, 1}: 1 y n [w T φ(x n ) + b] = 0 b = y n w T φ(x n ) Test Prediction: h(x) = sign( n y nα n k(x n, x) + b) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, / 40

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17