Linear, Binary SVM Classifiers COMPSCI 37D Machine Learning COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers / 6
Outline What Linear, Binary SVM Classifiers Do 2 Margin I 3 Loss and Regularized Risk 4 Training an SVM is a Quadratic Program 5 The KKT Conditions and the Support Vectors 6 The Dual Problem COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 2 / 6
What Linear, Binary SVM Classifiers Do The Separable Case? Where to place the boundary? The number of choices grows with d COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 3 / 6
What Linear, Binary SVM Classifiers Do SVMs Maximize the Smallest Margin As Placing the boundary as far as possible from the nearest samples improves generalization Leave as much empty space around the boundary as possible Only the points that barely make the margin matter These are the support vectors Initially, we don t know which points will be support vectors COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 4 / 6
What Linear, Binary SVM Classifiers Do The General Case If the data is not linearly separable, there must be misclassified samples. These have a negative margin Assign a penalty inversely proportional to the margin, and a penalty that grows affinely with the negative of the margin Give different weights to the two penalties (cross-validation!) Find the optimal compromise: minimum risk (total penalty) COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 5 / 6
Margin Separating Hyperplane X = R d and Y = {, } (more convenient labels) Hyperplane: n T x + c = 0 with knk = Decision rule: ŷ = h(x) =sign(n T x + c) n points towards the ŷ = half-space If y is the true label, decision is correct if n T x + c 0 if y = n T x + c apple 0 if y = More compactly, a decision is correct if y(n T x + c) 0 SVMs want this inequality to hold with a margin COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 6 / 6
Margin Margin The margin of (x, y) is the signed distance of x from the boundary: Positive if x is on the correct side of the boundary, negative otherwise µ v (x, y) def = y (n T x + c) v =(n, c) Margin of a training set T : t y^ = n separating hyperplane µ (x, ) v y^ = µ v (T ) def = min (x,y)2t µ v (x, y) Boundary separates T if µ (x, -) v µ v (T ) > 0 COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 7 / 6
Loss and Regularized Risk The Hinge Loss Reference margin µ > 0 reference margin reference margin (unknown, to be determined) Hinge loss (x, y): max{0,µ µ µ v (x, y)} Training samples with µ v (x, y) µ are classified correctly with a margin at least µ Some loss incurred as soon as µ v (x, y) <µ even if the sample is classified correctly mic ξ(x, ) µ (x, -) v y^ = µ* n separating hyperplane a 3 µ* y^ = µ (x, ) v ξ(x, -) COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 8 / 6
Loss and Regularized Risk The Training Risk NGE Loss The training risk for SVMs is not just N P N n=0 (x n, y n ) A regularization term is added to force µ to be small Separating hyperplane is n T x + c = 0 Let w T x + b = 0 with w =!n, b =!c and! = kwk = µ! is a reciprocal scaling factor if only w is changed: Large margin, small! Make risk higher when! is large (small margin): L T (w, b) def = 2 kwk2 + C N P N n=0 (x n, y n, w, b) where (x, y) = µ max{0,µ µ v (x, y)} e e = max{0,µ y (n T x + c)} = max{0, y(w T x + b)} µ a y c ONTHEUNE COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 9 / 6
Loss and Regularized Risk Regularized Risk ERM classifier: (w, b )=ERM T (w, b) =argmin at (w,b) L T (w, b) where L T (w, b) def P = 2 kwk2 + C N N n=0 (x n, y n, w, b) (x n, y n, w, b) def =max{0, y n (w T x n + b)} C determines a trade-off Large C )kwkless important ) larger! ) smaller margin ) fewer samples within the margin We buy a larger margin by accepting more samples inside it Cross-validation! COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 0 / 6
Training an SVM is a Quadratic Program Rephrasing as Training a Quadratic Program (w, b )=argmin (w,b) 2 kwk2 + C N P N n=0 (x n, y n, w, b) where (x n, y n, w, b) def =max{0, y n (w T x n + b)} Not differentiable because of the max: Bummer! Neat trick: Instead of minimizing L T over w, b, we introduce new variables n and minimize L T over w, b,,..., N with the constrainto n (x n, y n, w, b) We would not decrease L T if we made n bigger Advantage: We can now split n (x n, y n, w, b) into two separate, affine constraints because is a hinge: n 0 everywhere When o y n (w T x n + b) 0, we want n y n (w T x n + b) Thus, n 0 and y n (w T x n + b) + n 0 I 5 IT In Not O K COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers / 6
Training an SVM is a Quadratic Program Quadratic Program Formulation We achieve differentiability at the cost of adding N slack variables n : Old: min (w,b) 2 kwk2 + C N P N n=0 (x n, y n, w, b) where (x n, y n, w, b) def =max{0, y n (w T x n + b)} New: min w,b, f (w, ) where f (w, ) = 2 kwk2 + P N n= n subject to the constraints g def = C N y n (w T x n + b) + n 0 and with We have our quadratic program! n 0 s x Hii I l gu Curtain b COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 2 / 6
Training an SVM is a Quadratic Program The Lagrangian min w,b, f (w, ) where f (w, ) = 2 kwk2 + P N n= n subject to the constraints def = C N and with Lagrangian L(u, ) =f (u) L(w, b,,, ) y n (w T x n + b) + n 0 def = 2 kwk2 + T c NX n= n 0 n feel NX n [y n (w T x n + b) + n ] n= n= E u is (w, b, ) and has both n and n COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 3 / 6 NX n n
The KKT Conditions and the Support Vectors The KKT Conditions L(w, b,,, ) def = 2 kwk2 + P N m= m e @L g P N @w = w n= ny n x n = 0 @L @b = P N n= ny n = 0 @L @ n = n n = 0 y n (w T x n + b) + n 0 n 0 n 0 CONIC n 0 n [y n (w T x n + b) + n ]=0 n n = AI 0 a EE P N m= m[y m (w T x m + b) + m ] p P N m= y complementarity A m m a f COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 4 / 6
The KKT Conditions and the Support Vectors The Support Vectors w = P N n= ny n x n The separating-hyperplane parameters w are a linear combination of the training data points x n In the separable case, only the data points on the margin boundaries have nonzero n In the non-separable case, it s more complicated, but only data points near the margin have nonzero n Either way, these data points are called the support vectors Ln 0 only for these points in theseparable case COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 5 / 6 a µ SUPPORTVECTORS
The Dual Problem The Dual Some crank-turning: Solve KKT for w, and plug back in This yields a very simple dual: D( ) = NX n= NX Q n 2 m= to be maximized with constraints P N n= ny n = 0 and 0 apple n apple NX m n y m y n x T mx n. D turns out to be concave. Minimize D for a convex program Key feature: x n appears only in inner products with other x m This will have very important consequences n= COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 6 / 6