Machine Learning A Geometric Approach

Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU)

Linear Separator Ham Spam

From Perceptron to SVM 1959 osenblatt invention +max margin batch online 1962 Novikoff proof +kernels 1964 Vapnik Chervonenkis +soft-margin fall of USSR conservative updates inseparable case 1997 Cortes/Vapnik SVM online approx. max margin 1999 Freund/Schapire voted/avg: revived subgradient descent 2003 Crammer/Singer MIRA minibatch 2007--2010* Singer group Pegasos 2006 Singer group aggressive *mentioned in lectures but optional (others papers all covered in detail) AT&T Research 2002 Collins structured 2005* McDonald/Crammer/Pereira structured MIRA ex-at&t and students 3

Large Margin Classifier hw, xi + b apple 1 hw, xi + b 1 linear function f(x) =hw, xi + b

Why large margins? Maximum o robustness relative o o r + to uncertainty Symmetry breaking Independent of o ρ + + correctly classified instances Easy to find for + easy problems

Feature Map Φ SVM is often used with kernels

Large Margin Classifier hw, xi + b apple 1 hw, xi + b = 1 hw, xi + b =1 hw, xi + b 1 w functional margin: y i (w x i ) geometric margin: y i (w x i ) kwk = 1 kwk

Large Margin Classifier Q1: what if we want functional margin of 2? Q2: what if we want geometric margin of 1? hw, xi + b apple 1 hw, xi + b = 1 hw, xi + b =1 hw, xi + b 1 w SVM objective (max version): max w 1 kwk s.t. 8(x,y) 2 D, y(w x) 1 max. geometric margin s.t. functional margin is at least 1

Large Margin Classifier hw, xi + b apple 1 hw, xi + b = 1 hw, xi + b =1 hw, xi + b 1 w SVM objective (min version): min w kwk s.t. 8(x,y) 2 D, y(w x) 1 interpretation: small models generalize better min. weight vector s.t. functional margin is at least 1

Large Margin Classifier hw, xi + b apple 1 hw, xi + b = 1 hw, xi + b =1 hw, xi + b 1 w SVM objective (min version): min w 1 2 kwk2 s.t. 8(x,y) 2 D, y(w x) 1 w not differentiable, w 2 is. min. weight vector s.t. functional margin is at least 1

SVM vs. MIRA SVM: min weight vector to enforce functional margin of at least 1 on ALL EXAMPLES MIRA: min weight change to enforce functional margin of at least 1 on THIS EXAMPLE MIRA is 1-step or online approximation of SVM Aggressive MIRA SVM as p 1 min w 1 2 kwk2 s.t. 8(x,y) 2 D, y(w x) 1 min kw 0 wk 2 w 0 s.t. w 0 x 1 perceptron w MIRA x i

Convex Hull Interpretation max. distance between convex hulls how many support vectors in 2D? weight vector is determined by the support vectors alone X c.f. perceptron: w = what about MIRA? (x,y)2errors y x why don t use convex hulls for SVMs in practice??

Convexity and Convex Hulls convex combination

Optimization Primal optimization problem 1 min w 2 kwk2 s.t. 8(x,y) 2 D, y(w x) 1 constraint Convex optimization: convex function over convex set! Quadratic prog.: quadratic function w/ linear constraints linear quadratic

MIRA as QP MIRA is a trivial QP; can be solved geometrically what about multiple constraints (e.g. minibatch)? min kw 0 wk 2 w 0 s.t. w 0 x 1 kx i k w i x i 1 kx i k x i MIRA w i kx i k 1 w i x i

Optimization Primal optimization problem 1 min w 2 kwk2 s.t. 8(x,y) 2 D, y(w x) 1 Convex optimization: convex function over convex set! Lagrange function L(w, b, ) = 1 2 kwk2 X Derivatives in w need to vanish @ w L(w, b, αa) =w @ b L(w, b, a) w = X i i X i y i y i x=0 i i [y i [hx i,wi + b] 1] i y i x i =0 constraint model is a linear combo of a small subset of input (the support vectors) i.e., those with α i > 0

Lagrangian & Saddle Point equality: min x 2 s.t. x = 1 inequality: min x 2 s.t. x >= 1 Lagrangian: L(x, α)=x 2 - α(x-1) derivative in x need to vanish optimality is at saddle point with α x min x in primal => max α in dual α

Constrained Optimization minimize w,b 1 Quadratic Programming Quadratic Objective Linear Constraints 2 kwk2 subject to y i [hx i,wi + b] 1 Karush Kuhn Tucker w = X i constraint y i i x i KKT condition (complementary slackness) optimal point is achieved at active constraints where α i > 0 (α i =0 => inactive) i [y i [hw, x i i + b] 1] = 0

KKT => Support Vectors minimize w,b 1 2 kwk2 subject to y i [hx i,wi + b] 1 w = X i y i i x i w Karush Kuhn Tucker (KKT) Optimality Condition i [y i [hw, x i i + b] 1] = 0 i =0 i > 0=) y i [hw, x i i + b] =1

Properties w = X i y i i x i w Weight vector w as weighted linear combination of instances Only points on margin matter (ignore the rest and get same solution) Only inner products matter Quadratic program We can replace the inner product by a kernel Keeps instances away from the margin

Alternative: Primal=>Dual Lagrange function L(w, b, ) = 1 2 kwk2 X i [y i [hx i,wi + b] 1] Derivatives in w need to vanish α @ w L(w, b, a) =w i X i i y i x i =0 X @ b L(w, b, a) w = i y i y i x=0 i Plugging w back into L yields maximize 1 2 X i,j i j y i y j hx i,x j i + X i i dual variables subject to X i y i = 0 and i 0

Primal vs. Dual Primal minimize w,b 1 2 kwk2 subject to y i [hx i,wi + b] 1 w = X i y i i x i w Dual maximize 1 2 X i,j i j y i y j hx i,x j i + X i i subject to X i y i = 0 and i 0 dual variables

Solving the optimization problem Dual problem maximize 1 2 X i,j i j y i y j hx i,x j i + X i i subject to X i i y i = 0 and i 0 dual variables If problem is small enough (1000s of variables) we can use off-the-shelf solver (CVXOPT, CPLEX, OOQP, LOQO) For larger problem use fact that only SVs matter and solve in blocks (active set method).

Quadratic Program in Dual Dual problem maximize 1 subject to 0 2 T Q T b Q: what s the Q in SVM primal? how about Q in SVM dual? Methods Gradient Descent Coordinate Descent aka., Hildreth Algorithm Sequential Minimal Optimization (SMO) Quadratic Programming Objective: Quadratic function Q is positive semidefinite Constraints: Linear functions

Convex QP if Q is positive (semi) definite, i.e., x T Qx 0 for all x, then convex QP => local min/max is global min/max if Q = 0, it reduces to linear programming if Q is indefinite => saddlepoint CP general QP is NP-hard; convex QP is polynomial-time svm LP QP

QP: Hildreth Algorithm idea 1: update one coordinate while fixing all other coordinates e.g., update coordinate i is to solve: 1 argmax 2 T Q T b i subject to 0 Quadratic function with only one variable Maximum => first-order derivative is 0

QP: Hildreth Algorithm idea 2: choose another coordinate and repeat until meet stopping criterion reach maximum or increase between 2 consecutive iterations is very small or after some # of iterations how to choose coordinate: sweep pattern Sequential: 1, 2,..., n, 1, 2,..., n,... 1, 2,..., n, n-1, n-2,...,1, 2,... Random: permutation of 1,2,..., n Maximal Descent choose i with maximal descent in objective

QP: Hildreth Algorithm initialize i =0 for all i repeat pick i following sweep pattern solve i 1 argmax 2 T Q T b i subject to 0 until meet stopping criterion

QP: Hildreth Algorithm maximize 1 2 T 4 1 1 2 T 6 4 subject to 0 choose coordinates 1, 2, 1, 2,...

QP: Hildreth Algorithm pros: extremely simple no gradient calculation easy to implement cons: converges slow, compared to other methods can t deal with too many constraints works for minibatch MIRA but not SVM

Linear Separator Ham Spam

Large Margin Classifier hw, xi + b apple 1 hw, xi + b 1 + + linear function f(x) =hw, xi + b linear separator is impossible

Large Margin Classifier hw, xi + b apple 1 hw, xi + b 1 + minimum error separator Theorem (Minsky & Papert) is impossible Finding the minimum error separating hyperplane is NP hard

Adding slack variables hw, xi + b apple 1+ hw, xi + b 1 + - minimize amount Convex optimization problem of slack

margin violation vs. misclassification misclassification is also margin violation (ξ>0)

Adding slack variables Hard margin problem minimize w,b 1 2 kwk2 subject to y i [hw, x i i + b] 1 With slack variables w determines ξ 1 minimize w,b 2 kwk2 + C X i ξ i subject to y i [hw, x i i + b] 1 i and i 0 Problem is always feasible. Proof: w = 0 and b = 0 and i =1 C=0? C=+? (also yields upper bound) C=+ => not tolerant on violation => hard margin

Review: Convex Optimization Primal optimization problem minimize x f(x) subjecttoc i (x) apple 0 Lagrange function L(x, ) =f(x)+ X i i c i (x) First order optimality conditions in x @ x L(x, ) =@ x f(x)+ X i i @ x c i (x) =0 Solve for x and plug it back into L maximize (keep explicit constraints) L(x( ), )

Dual Problem Primal optimization problem 1 minimize w,b 2 kwk2 + C X i ξ i subject to y i [hw, x i i + b] 1 i and i 0 Lagrange function L(w, b, ) = 1 2 kwk2 + C X ξ,,η) i i X i [y i [hx i,wi + b]+ i 1] X i i i i optimality in (w, ξ) is at saddle point with α,η Derivatives in (w, ξ) need to vanish

Dual Problem Lagrange function L(w, b, ) = 1 2 kwk2 + C X ξ,,η) i Derivatives in w need to vanish i @ w L(w, b,,, ) =w X i [y i [hx i,wi + b]+ i 1] i X i i y i x i =0 X i i i @ b L(w, b,,, ) = X i i y i =0 @ i L(w, b,,, ) =C i i =0 Plugging terms back into L yields maximize subject to X i 1 2 X i,j i j y i y j hx i,x j i + X i i y i = 0 and i 2 [0,C] dual variables i bound influence

Karush Kuhn Tucker Conditions L(w, b, ) = 1 2 kwk2 + C X ξ,,η) i i X i [y i [hx i,wi + b]+ i 1] X i i i i w = X i y i i x i α=c α=0 0<α<C 0<α<C w α=0 i [y i [hw, x i i + b]+ i 1] = 0 i i =0 0 apple i = C i apple C α=c i =0=) y i [hw, x i i + b] 1 0 < i <C=) y i [hw, x i i + b] =1 i = C =) y i [hw, x i i + b] apple 1 why these are not disjoint?

Support Vectors and Violations all circled and squared examples are support vectors (α>0) they include ξ=0 (hard-margin SVs), ξ>0 (margin violations), and ξ>1 (misclassifications)

Support Vectors and Violations non-support vectors (α=0, ξ=0) (0<α C, ξ 0) support vectors (α=c, ξ>0) margin violations α=c (α=c, ξ>1) misclassifications all circled and squared examples are support vectors (α>0) they include ξ=0 (hard-margin SVs), ξ>0 (margin violations), and ξ>1 (misclassifications)

SVM with sklearn python demo.py 1e10 python demo.py 1e10

SVM with sklearn python demo.py 0.1 python demo.py 0.01

SVM with sklearn + + In [2]: clf = svm.svc(kernel='linear', C=1e10) In [3]: X = [[1,1], [1,-1], [-1,1], [-1,-1]] In [4]: Y = [1,1,-1,-1] In [5]: clf.fit(x, Y) In [6]: clf.support_ Out[6]: array([3, 1], dtype=int32) In [7]: clf.dual_coef_ Out[7]: array([[-0.5, 0.5]]) In [8]: clf.coef_ Out[8]: array([[ 1., 0.]]) In [9]: clf.intercept_ Out[9]: array([-0.]) In [10]: clf.support_vectors_ Out[10]: array([[-1., -1.], [ 1., -1.]]) In [11]: clf.n_support_ Out[11]: array([1, 1], dtype=int32) In [2]: X = [[1,1], [1,-1], [-1,1], [-1,-1], [-0.1,-0.1]] In [3]: Y = [1,1,-1,-1, 1] In [4]: clf = svm.svc(kernel='linear', C=1) In [5]: clf.fit(x, Y) In [6]: clf.support_vectors_ Out[6]: array([[-1., 1. ], [-1., -1. ], [ 1., -1. ], [-0.1, -0.1]]) In [7]: clf.dual_coef_ α=c Out[7]: array([[-0.45, -0.6, 0.05, 1. ]]) In [8]: clf.coef_ Out[8]: array([[ 1.00000000e+00, 1.49011611e-09]]) In [9]: clf.intercept_ Out[9]: array([-0.]) + α=c + +

SVM with sklearn + + In [2]: X = [[1,1], [1,-1], [-1,1], [-1,-1], [-0.1,-0.1]] In [3]: Y = [1,1,-1,-1, 1] In [12]: clf = svm.svc(kernel='linear', C=1e10) In [13]: clf.fit(x, Y) In [14]: clf.coef_ Out[14]: array([[ 2.02010102e+00, 1.00999543e-04]]) In [15]: clf.intercept_ Out[15]: array([ 1.02013469]) In [16]: clf.support_vectors_ Out[16]: array([[-1., 1. ], [-1., -1. ], [-0.01, -0.01]]) In [17]: clf.dual_coef_ Out[17]: array([[-1.01000001, -1.03050607, 2.04050608]]) In [2]: clf = svm.svc(kernel='linear', C=1e10) In [3]: X = [[1,1], [1,-1], [-1,1], [-1,-1]] In [4]: Y = [1,1,-1,-1] In [5]: clf.fit(x, Y) In [6]: clf.support_ Out[6]: array([3, 1], dtype=int32) In [7]: clf.dual_coef_ Out[7]: array([[-0.5, 0.5]]) In [8]: clf.coef_ Out[8]: array([[ 1., 0.]]) In [9]: clf.intercept_ Out[9]: array([-0.]) In [10]: clf.support_vectors_ Out[10]: array([[-1., -1.], [ 1., -1.]]) In [11]: clf.n_support_ Out[11]: array([1, 1], dtype=int32) + + +

SVM with sklearn + + In [2]: clf = svm.svc(kernel='linear', C=1e10) In [3]: X = [[1,1], [1,-1], [-1,1], [-1,-1]] In [4]: Y = [1,1,-1,-1] In [5]: clf.fit(x, Y) In [6]: clf.support_ Out[6]: array([3, 1], dtype=int32) In [7]: clf.dual_coef_ Out[7]: array([[-0.5, 0.5]]) In [8]: clf.coef_ Out[8]: array([[ 1., 0.]]) In [9]: clf.intercept_ Out[9]: array([-0.]) In [10]: clf.support_vectors_ Out[10]: array([[-1., -1.], [ 1., -1.]]) In [11]: clf.n_support_ Out[11]: array([1, 1], dtype=int32) In [2]: X = [[1,1], [1,-1], [-1,1], [-1,-1], [-2,0]] In [3]: Y = [1,1,-1,-1, 1] In [4]: clf = svm.svc(kernel='linear', C=1) what if C=1e10? In [5]: clf.fit(x, Y) In [6]: clf.coef_ Out[6]: array([[ 1., 0.]]) In [7]: clf.support_vectors_ Out[7]: array([[-1., 1.], + [-1., -1.], α=c [ 1., 1.], [ 1., -1.], i =0=) y i [hw, x i i + b] 1 [-2., 0.]]) In [8]: clf.dual_coef_ 0 < i <C=) y i [hw, x i i + b] =1 Out[8]: array([[-1., -1., 0.5, 0.5, 1. ]]) i = C =) y i [hw, x i i + b] apple 1 In [9]: clf.intercept_ Out[9]: array([-0.]) α=c α=c + +

hard vs. soft margins

C=1 51

C=20 52

From Constrained Optimization to Unconstrained Optimization Optimization (back to Primal) Learning an SVM has been formulated as a constrained optimization problem over w and ξ min w R d,ξ i R w 2 + C + NX i ξ i subject to y i ³ w > x i + b 1 ξ i for i =1...N The constraint y i ³ w > x i + b 1 ξ i, can be written more concisely as y i f(x i ) 1 ξ i which, together with ξ i 0, is equivalent to ξ i = max (0, 1 y i f(x i )) w determines ξ Hence the learning problem is equivalent to the unconstrained optimization problem over w min w R w 2 + C d regularization NX i max (0, 1 y i f(x i )) loss function slides 49-56 from Andrew Zisserman (Oxford/DeepMind); with annotations 53

Loss function min w 2 + C w R d NX i max (0, 1 y i f(x i )) loss function w T x + b = 0 Points are in three categories: Support Vector 1. y i f(x i ) > 1 Point is outside margin. No contribution to loss 2. y i f(x i )=1 Point is on margin. No contribution to loss. As in hard margin case. 3. y i f(x i ) < 1 Point violates margin constraint. Contributes to loss (margin violation ξ>0, including misclassification ξ>1) w Support Vector 54

Loss functions very bad (misclassification) not good enough! good SVM uses hinge loss an approximation to the 0-1 loss max (0, 1 y i f(x i )) y i f(x i ) (perceptron uses a shifted hinge-loss touching the origin) 55

+ SVM min w R d C N X i max (0, 1 y i f(x i )) + w 2 convex convex + convex = convex! 56

Gradient (or steepest) descent algorithm for SVM To minimize a cost function C(w) use the iterative update where η is the learning rate. First, rewrite the optimization problem as an average min w C(w) = λ 2 w 2 + 1 N = 1 N NX i w t+1 w t η t w C(w t ) NX i max (0, 1 y i f(x i )) µ λ 2 w 2 +max(0, 1 y i f(x i )) (with λ =2/(NC)uptoanoverallscaleoftheproblem)and f(x) =w > x + b Because the hinge loss is not differentiable, a sub-gradient is computed 57

Sub-gradient for hinge loss L(x i,y i ; w) =max(0, 1 y i f(x i )) f(x i )=w > x i + b L w = y ix i sub-gradients L w =0 y i f(x i ) 58

Sub-gradient descent algorithm for SVM The iterative update is C(w) = 1 N NX i µ λ 2 w 2 + L(x i,y i ; w) w t+1 w t η wt C(w t ) w t η 1 N where η is the learning rate. NX i (λw t + w L(x i,y i ; w t )) batch gradient Then each iteration t involves cycling through the training data with the updates: just like perceptron! perc: 0 w t+1 w t η(λw t y i x i ) if y i f(x i ) < 1 w t ηλw t otherwise In the Pegasos algorithm the learning rate is set at η t = λt 1 online gradient 59

Pegasos Stochastic Gradient Descent Algorithm (SGD) Randomly sample from the training data 10 2 6 10 1 4 2 energy 10 0 0 10-1 -2-4 10-2 0 50 100 150 200 250 300-6 -6-4 -2 0 2 4 6 SGD is online update: gradient on one example (unbiasedly) approximates the gradient on the whole training data slides 49-56 from Andrew Zisserman (Oxford/DeepMind); with annotations 61