Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training examples (support vectors) Sound generalization theory (bounds or error based on margin) Can be easily extended to nonlinear separation (kernel machines) Maximum margin classifier

Maximum margin classifier Classifier margin Given a training set D, a classifier confidence margin is: ρ = min (x,y) Dyf(x) It is the minimal confidence margin (for predicting the true label) among training examples 2

A classifier geometric margin is: ρ w = min yf(x) (x,y) D w Maximum margin classifier Canonical hyperplane There is an infinite number of equivalent formulation for the same hyperplane: w T x + w 0 = 0 α(w T x + w 0 ) = 0 α 0 The canonical hyperplane is the hyperplane having confidence margin equal to 1: ρ = min (x,y) Dyf(x) = 1 Its geometric margin is: ρ w = 1 w Maximum margin classifier 3

[ Margin Error Bound] Consider the set of decision functions f(x) = signw T x with w Λ and x R, for some R,Λ > 0. Moreover, let ρ > 0 and ν denote the fraction of training examples with margin smaller than ρ/ w, referred to as the margin error. For all distributions P generating the data, with probability at least 1 δ over the drawing of the m training patterns, and for any ρ > 0 and δ (0, 1), the probability that a test pattern drawn from P will be misclassified is 4

bound from above by Here, c is a universal constant. ( ) c R2 Λ ν + 2 m ρ 2 ln 2 m + ln(1/δ). Learning problem minw,w 0 1 2 w 2 subject to: y(w T x + w 0 ) 1 (x, y) D Note constraints guarantee that all points are correctly classified (plus canonical form) minimization corresponds to maximizing the (squared) margin quadratic optimization problem (objective is quadratic, points satisfying constraints form a convex set) Lagrangian The constraints can be included in the minimization using Lagrange multipliers α i 0 (m = D ): L(w, w 0, α) = 1 2 w 2 α i (y i (w T x i + w 0 ) 1) The Lagrangian is minimized wrt w, w 0 and maximized wrt α i (solution is a saddle point) Dual formulation L(w, w 0, α) = 1 2 w 2 α i (y i (w T x i + w 0 ) 1) Vanishing derivatives wrt primal variables we get: w 0 L(w, w 0, α) = 0 α i y i = 0 m w L(w, b, α) = 0 w = α i y i x i 5

Dual formulation Substituting in the Lagrangian we get: 1 2 w 2 1 2 α i (y i (w T x i + w 0 ) 1) = α i α j y i y j x T i x j α i y i w 0 + α i = }{{} =0 α i 1 2 α i α j y i y j x T i x j α i α j y i y j x T i x j = L(α) which is to be maximized wrt the dual variables α Dual formulation max α IR m α i 1 2 α i α j y i y j x T i x j subject to α i 0 i = 1,..., m α i y i = 0 The resulting maximization problem including the constraints Still a quadratic optimization problem Note The primal formulation has d + 1 variables (number of features +1): minw,w 0 1 2 w 2 The dual formulation has m variables (number of training examples): max α IR m α i 1 2 One can choose the formulation with less variables α i α j y i y j x T i x j 6

Decision function Substituting w = m α iy i x i in the decision function we get: f(x) = w T x + w 0 = α i y i x i x + w 0 The decision function is linear combination of dot products between training points and the test point dot product is kind of similarity between points Weights of the combination are α i y i : large α i implies large contribution towards class y i (times the similarity) Karush-Khun-Tucker conditions (KKT) L(w, w 0, α) = 1 2 w 2 α i (y i (w T x i + w 0 ) 1) At the saddle point is holds that for all i: α i (y i (w T x i + w 0 ) 1) = 0 Thus, either the example does not contribute to the final f(x): α i = 0 or the example stays on the minimal confidence hyperplane from the decision one: y i (w T x i + w 0 ) = 1 7

Support vectors points staying on the minimal confidence hyperplanes are called support vectors All other points do not contribute to the final decision function (i.e. they could be removed from the training set) SVM are sparse i.e. they typically have few support vectors Decision function bias The bias w 0 can be computed from the KKT conditions Given an arbitrary support vector x i (with α i > 0) the KKT conditions imply: y i (w T x i + w 0 ) = 1 y i w T x i + y i w 0 = 1 w 0 = 1 y iw T x i y i For robustness, the bias is usually averaged over all support vectors 8

Soft margin SVM Soft margin SVM Slack variables 9

1 min w X,w 0 IR, ξ IR m 2 w 2 + C subject to ξ i y i (w T x + w 0 ) 1 ξ i ξ i 0 i = 1,..., m i = 1,..., m A slack variable ξ i represents the penalty for example x i not satisfying the margin constraint The sum of the slacks is minimized together to the inverse margin The regularization parameter C 0 trades-off data fitting and size of the margin Soft margin SVM Regularization theory Regularized loss minimization problem 1 min w X,w 0 IR, ξ IR m 2 w 2 + C The loss term accounts for error minimization l(y i, f(x i )) The margin maximization term accounts for regularization i.e. solutions with larger margin are preferred Note Regularization is a standard approach to prevent overfitting It corresponds to a prior for simpler (more regular, smoother) solutions Soft margin SVM 10

Hinge loss l(y i, f(x i )) = 1 y i f(x i ) + = 1 y i (w T x i + w 0 ) + z + = z if z > 0 and 0 otherwise (positive part) it corresponds to the slack variable ξ i (violation of margin costraint) all examples not violating margin costraint have zero loss (sparse set of support vectors) Soft margin SVM Lagrangian L = C ξ i + 1 2 w 2 α i (y i (w T x i + w 0 ) 1 + ξ i ) β i ξ i where α i 0 and β i 0 Vanishing derivatives wrt primal variables we get: w 0 L = 0 α i y i = 0 m w L = 0 w = α i y i x i ξ i L = 0 C α i β i = 0 11

Soft margin SVM Dual formulation Substituting in the Lagrangian we get C ξ i + 1 2 w 2 α i (y i (w T x i + w 0 ) 1 + ξ i ) β i ξ i = ξ i (C α i β i ) + 1 }{{} 2 }{{} =0 =0 α i y i w 0 + α i = α i 1 2 α i α j y i y j x T i x j α i α j y i y j x T i x j α i α j y i y j x T i x j = L(α) Soft margin SVM Dual formulation max α IR m α i 1 2 α i α j y i y j x T i x j subject to 0 α i C i = 1,..., m α i y i = 0 The box constraint for α i comes from C α i β i = 0 (and the fact that both α i 0 and β i 0) Soft margin SVM Karush-Khun-Tucker conditions (KKT) L = C ξ i + 1 2 w 2 α i (y i (w T x i + w 0 ) 1 + ξ i ) β i ξ i At the saddle point is holds that for all i: α i (y i (w T x i + w 0 ) 1 + ξ i ) = 0 β i ξ i = 0 Thus, support vectors (α i > 0) are examples for which (y i (w T x i + w 0 ) 1 12

Soft margin SVM Support Vectors α i (y i (w T x i + w 0 ) 1 + ξ i ) = 0 β i ξ i = 0 If α i < C, C α i β i = 0 and β i ξ i = 0 imply that ξ i = 0 These are called unbound SV ((y i (w T x i + w 0 ) = 1, they stay on the confidence one hyperplane If α i = C then ξ i > 0 and the SV are margin errors (bound SV) Support vectors 13