Support vector machines Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 SVM, kernel methods and multiclass 1/23
Outline 1 Constrained optimization, Lagrangian duality and KKT 2 Support vector machines SVM, kernel methods and multiclass 2/23
Outline 1 Constrained optimization, Lagrangian duality and KKT 2 Support vector machines SVM, kernel methods and multiclass 3/23
Constrained optimization, Lagrangian duality and KKT SVM, kernel methods and multiclass 4/23
Review: Constrained optimization Optimization problem in canonical form min x X f (x) s.t. h i (x) = 0, g j (x) 0, i [[1, n]] j [[1, m]] with X R p. f, g j functions, h i affine functions. SVM, kernel methods and multiclass 5/23
Review: Constrained optimization Optimization problem in canonical form min x X f (x) s.t. h i (x) = 0, g j (x) 0, i [[1, n]] j [[1, m]] with X R p. f, g j functions, h i affine functions. The problem is convex if f, g j and X are convex (w.l.o.g X = ). SVM, kernel methods and multiclass 5/23
Review: Constrained optimization Optimization problem in canonical form min x X f (x) s.t. h i (x) = 0, g j (x) 0, i [[1, n]] j [[1, m]] with X R p. f, g j functions, h i affine functions. The problem is convex if f, g j and X are convex (w.l.o.g X = ). Lagrangian m L(x, λ, µ) = f (x) + λ i h i (x) + µ j g j (x) j=1 SVM, kernel methods and multiclass 5/23
Lagrangian duality Lagrangian m L(x, λ, µ) = f (x) + λ i h i (x) + µ j g j (x) j=1 SVM, kernel methods and multiclass 6/23
Lagrangian duality Lagrangian L(x, λ, µ) = f (x) + Primal vs Dual problem m λ i h i (x) + µ j g j (x) j=1 p = min x X max λ R n,µ R m + L(x, λ, µ) (P) d = max λ R n,µ R m + min x X L(x, λ, µ) (D) SVM, kernel methods and multiclass 6/23
Maxmin inequalities max y min f (x, y) min x x max f (x, y) y SVM, kernel methods and multiclass 7/23
Maxmin inequalities f (x, y) max f (x, y) y SVM, kernel methods and multiclass 7/23
Maxmin inequalities min f (x, y) min x x max f (x, y) y SVM, kernel methods and multiclass 7/23
Maxmin inequalities max y min f (x, y) min x x max f (x, y) y SVM, kernel methods and multiclass 7/23
Maxmin inequalities max y min f (x, y) min x x max f (x, y) y Weak duality In general, we have d p. This is called weak duality. SVM, kernel methods and multiclass 7/23
Maxmin inequalities max y min f (x, y) min x x max f (x, y) y Weak duality In general, we have d p. This is called weak duality. Strong duality In some cases, we have strong duality: d = p Solutions to (P) and (D) are the same SVM, kernel methods and multiclass 7/23
Slater s qualification condition Slater s qualification condition is a condition on the constraints that guarantees that strong duality holds. Consider an optimization problem in canonical form. SVM, kernel methods and multiclass 8/23
Slater s qualification condition Slater s qualification condition is a condition on the constraints that guarantees that strong duality holds. Consider an optimization problem in canonical form. Definition: Slater s condition (strong form) There exists x X such that h(x) = 0 and g(x) < 0 entrywise. SVM, kernel methods and multiclass 8/23
Slater s qualification condition Slater s qualification condition is a condition on the constraints that guarantees that strong duality holds. Consider an optimization problem in canonical form. Definition: Slater s condition (strong form) There exists x X such that h(x) = 0 and g(x) < 0 entrywise. Definition: Slater s condition (weak form) There exists x X such that h(x) = 0 and g(x) 0 entrywise, but with g i (x) < 0 if g i is not affine. SVM, kernel methods and multiclass 8/23
Slater s qualification condition Slater s qualification condition is a condition on the constraints that guarantees that strong duality holds. Consider an optimization problem in canonical form. Definition: Slater s condition (strong form) There exists x X such that h(x) = 0 and g(x) < 0 entrywise. Definition: Slater s condition (weak form) There exists x X such that h(x) = 0 and g(x) 0 entrywise, but with g i (x) < 0 if g i is not affine. Slater s conditions requires that there exists a feasible point which is strictly feasible for all non-affine constraints. SVM, kernel methods and multiclass 8/23
Karush-Kuhn-Tucker conditions Theorem For a convex problem defined by differentiable functions f, h i, g j, x is an optimal solution if and only if there exists (λ, µ) such that the KKT conditions are satisfied. KKT conditions f (x) + λ i h i (x) + m µ j g j (x) = 0 (Lagrangian stationarity) j=1 h(x) = 0, g(x) 0 (primal feasibility) j [[1, m]], µ j 0 (dual feasibility) µ j g j (x) = 0 (complementary slackness) SVM, kernel methods and multiclass 9/23
Outline 1 Constrained optimization, Lagrangian duality and KKT 2 Support vector machines SVM, kernel methods and multiclass 10/23
Support vector machines SVM, kernel methods and multiclass 11/23
Hard margin SVM Binary classification problem with y i { 1, 1}. SVM, kernel methods and multiclass 12/23
Hard margin SVM Binary classification problem with y i { 1, 1}. Margin 1 w SVM, kernel methods and multiclass 12/23
Hard margin SVM Binary classification problem with y i { 1, 1}. Margin 1 w Constraints: for y i = 1 require w x i + b 1 for y i = 1 require w x i + b 1 SVM, kernel methods and multiclass 12/23
Hard margin SVM Binary classification problem with y i { 1, 1}. Margin 1 w Constraints: for y i = 1 require w x i + b 1 for y i = 1 require w x i + b 1 This leads to min 1 2 w 2 s.t. i, y i (w x i + b) 1 SVM, kernel methods and multiclass 12/23
Hard margin SVM Binary classification problem with y i { 1, 1}. Margin 1 w Constraints: for y i = 1 require w x i + b 1 for y i = 1 require w x i + b 1 This leads to min 1 2 w 2 s.t. i, y i (w x i + b) 1 quadratic program (not a so useful property nowadays) unfeasible if the data is not separable SVM, kernel methods and multiclass 12/23
Hard-margin SVM SVM, kernel methods and multiclass 13/23
Soft margin SVM Authorize some points to be on the wrong side of the margin Penalize by a cost proportional to the distance to the margin Introduce some slack variables ξ i measuring the violation for each datapoint. SVM, kernel methods and multiclass 14/23
Soft margin SVM Authorize some points to be on the wrong side of the margin Penalize by a cost proportional to the distance to the margin Introduce some slack variables ξ i measuring the violation for each datapoint. min w,ξ 1 2 w 2 + C s.t. i, ξ i { y i (w x i + b) 1 ξ i ξ i 0 SVM, kernel methods and multiclass 14/23
Lagrangian of the SVM L(w, ξ, α, ν) = 1 2 w 2 + C ( ξ i + α i 1 ξi y i (w x i + b) ) ν i ξ i SVM, kernel methods and multiclass 15/23
Lagrangian of the SVM L(w, ξ, α, ν) = 1 ( 2 w 2 + C ξ i + α i 1 ξi y i (w x i + b) ) ν i ξ i = 1 2 w 2 w ( ) α i y i x i + ξ i (C α i ν i ) α i y i b + α i SVM, kernel methods and multiclass 15/23
Lagrangian of the SVM L(w, ξ, α, ν) = 1 2 w 2 + C ξ i + ( α i 1 ξi y i (w x i + b) ) = 1 2 w 2 w ( ) α i y i x i + ξ i (C α i ν i ) Stationarity of the Lagrangian L w L = w α i y i x i, = C α i ν i ξ i and ν i ξ i α i y i b + L b = α i y i. α i SVM, kernel methods and multiclass 15/23
Lagrangian of the SVM L(w, ξ, α, ν) = 1 2 w 2 + C ξ i + ( α i 1 ξi y i (w x i + b) ) = 1 2 w 2 w ( ) α i y i x i + ξ i (C α i ν i ) Stationarity of the Lagrangian L w L = w α i y i x i, = C α i ν i ξ i and ν i ξ i α i y i b + L b = α i y i. α i So that L = 0 leads to w = α i y i x i, 0 α i C and α i y i = 0. SVM, kernel methods and multiclass 15/23
Dual of the SVM max 1 2 α i y i x i + α i α 2 s.t. α i y i = 0, i, 0 α i C. with max α 1 2 α D y KD y α + α 1 s.t. α y = 0, 0 α C. y = (y 1,..., y n ) the vector of labels D y = Diag(y) a diagonal matrix with the label K the Gram matrix with K ij = x i x j
Dual of the SVM max α 1 2 s.t. 1 i,j n α i α j y i y j x i x j + α i α i y i = 0, i, 0 α i C. SVM, kernel methods and multiclass 16/23
Dual of the SVM max 1 2 α i y i x i + α i α 2 s.t. α i y i = 0, i, 0 α i C. with max α 1 2 α D y KD y α + α 1 s.t. α y = 0, 0 α C. y = (y 1,..., y n ) the vector of labels D y = Diag(y) a diagonal matrix with the label K the Gram matrix with K ij = x i x j SVM, kernel methods and multiclass 16/23
KKT conditions for the SVM SVM, kernel methods and multiclass 17/23
KKT conditions for the SVM w = α i y i x i α i + ν i = C α i y i = 0 1 ξ i y i f (x i ) 0 ξ i 0 α i 0 ν i 0 ( α i 1 ξi y i f (x i ) ) = 0 ν i ξ i = 0 (LS) (LS) (LS) (PF) (PF) (DF) (DF) (CS) (CS) with f (x i ) = w x i + b SVM, kernel methods and multiclass 17/23
KKT conditions for the SVM w = α i y i x i α i + ν i = C α i y i = 0 1 ξ i y i f (x i ) 0 ξ i 0 α i 0 ν i 0 ( α i 1 ξi y i f (x i ) ) = 0 ν i ξ i = 0 (LS) (LS) (LS) (PF) (PF) (DF) (DF) (CS) (CS) Let I = {i ξ i > 0} with f (x i ) = w x i + b SVM, kernel methods and multiclass 17/23
KKT conditions for the SVM w = α i y i x i α i + ν i = C α i y i = 0 1 ξ i y i f (x i ) 0 ξ i 0 α i 0 ν i 0 ( α i 1 ξi y i f (x i ) ) = 0 ν i ξ i = 0 (LS) (LS) (LS) (PF) (PF) (DF) (DF) (CS) (CS) Let I = {i ξ i > 0} M = {i y i f (x i ) = 1} with f (x i ) = w x i + b SVM, kernel methods and multiclass 17/23
KKT conditions for the SVM w = α i y i x i α i + ν i = C α i y i = 0 1 ξ i y i f (x i ) 0 ξ i 0 α i 0 ν i 0 ( α i 1 ξi y i f (x i ) ) = 0 ν i ξ i = 0 (LS) (LS) (LS) (PF) (PF) (DF) (DF) (CS) (CS) Let I = {i ξ i > 0} M = {i y i f (x i ) = 1} S = {i α i 0} with f (x i ) = w x i + b SVM, kernel methods and multiclass 17/23
KKT conditions for the SVM w = α i y i x i α i + ν i = C α i y i = 0 1 ξ i y i f (x i ) 0 ξ i 0 α i 0 ν i 0 ( α i 1 ξi y i f (x i ) ) = 0 ν i ξ i = 0 (LS) (LS) (LS) (PF) (PF) (DF) (DF) (CS) (CS) Let I = {i ξ i > 0} M = {i y i f (x i ) = 1} S = {i α i 0} W = (I M) c with f (x i ) = w x i + b SVM, kernel methods and multiclass 17/23
KKT conditions for the SVM w = α i y i x i α i + ν i = C α i y i = 0 1 ξ i y i f (x i ) 0 ξ i 0 α i 0 ν i 0 ( α i 1 ξi y i f (x i ) ) = 0 ν i ξ i = 0 (LS) (LS) (LS) (PF) (PF) (DF) (DF) (CS) (CS) Let I = {i ξ i > 0} M = {i y i f (x i ) = 1} S = {i α i 0} W = (I M) c with f (x i ) = w x i + b SVM, kernel methods and multiclass 17/23
KKT conditions for the SVM w = α i y i x i α i + ν i = C α i y i = 0 1 ξ i y i f (x i ) 0 ξ i 0 α i 0 ν i 0 ( α i 1 ξi y i f (x i ) ) = 0 ν i ξ i = 0 with f (x i ) = w x i + b (LS) (LS) (LS) (PF) (PF) (DF) (DF) (CS) (CS) Let I = {i ξ i > 0} M = {i y i f (x i ) = 1} S = {i α i 0} W = (I M) c i I ν i = 0 α i = C i S SVM, kernel methods and multiclass 17/23
KKT conditions for the SVM w = α i y i x i α i + ν i = C α i y i = 0 1 ξ i y i f (x i ) 0 ξ i 0 α i 0 ν i 0 ( α i 1 ξi y i f (x i ) ) = 0 ν i ξ i = 0 with f (x i ) = w x i + b (LS) (LS) (LS) (PF) (PF) (DF) (DF) (CS) (CS) Let I = {i ξ i > 0} M = {i y i f (x i ) = 1} S = {i α i 0} W = (I M) c i I ν i = 0 α i = C i S i W α i = 0 i / S We have 0 α i C. SVM, kernel methods and multiclass 17/23
KKT conditions for the SVM w = α i y i x i α i + ν i = C α i y i = 0 1 ξ i y i f (x i ) 0 ξ i 0 α i 0 ν i 0 ( α i 1 ξi y i f (x i ) ) = 0 ν i ξ i = 0 with f (x i ) = w x i + b (LS) (LS) (LS) (PF) (PF) (DF) (DF) (CS) (CS) Let I = {i ξ i > 0} M = {i y i f (x i ) = 1} S = {i α i 0} W = (I M) c i I ν i = 0 α i = C i S i W α i = 0 i / S We have 0 α i C. The set S of support vectors is therefore composed of some points on the margin and all incorrectly placed points. SVM, kernel methods and multiclass 17/23
SVM summary so far Optimization problem formulated as a strongly convex QP SVM, kernel methods and multiclass 18/23
SVM summary so far Optimization problem formulated as a strongly convex QP whose dual is also a QP SVM, kernel methods and multiclass 18/23
SVM summary so far Optimization problem formulated as a strongly convex QP whose dual is also a QP The support vectors are the points that have a non zero optimal weight α i SVM, kernel methods and multiclass 18/23
SVM summary so far Optimization problem formulated as a strongly convex QP whose dual is also a QP The support vectors are the points that have a non zero optimal weight α i The optimal solution is w = i S α i y ix i, i.e. a weighted combination of the support vectors SVM, kernel methods and multiclass 18/23
SVM summary so far Optimization problem formulated as a strongly convex QP whose dual is also a QP The support vectors are the points that have a non zero optimal weight α i The optimal solution is w = i S α i y ix i, i.e. a weighted combination of the support vectors The solution does not depend on the well-classified points Leads to working set strategies. Computational gain SVM, kernel methods and multiclass 18/23
SVM summary so far Optimization problem formulated as a strongly convex QP whose dual is also a QP The support vectors are the points that have a non zero optimal weight α i The optimal solution is w = i S α i y ix i, i.e. a weighted combination of the support vectors The solution does not depend on the well-classified points Leads to working set strategies. Computational gain Remarks: 1 the dual solution α is not necessarily unique there might be several possible sets of support vectors. SVM, kernel methods and multiclass 18/23
SVM summary so far Optimization problem formulated as a strongly convex QP whose dual is also a QP The support vectors are the points that have a non zero optimal weight α i The optimal solution is w = i S α i y ix i, i.e. a weighted combination of the support vectors The solution does not depend on the well-classified points Leads to working set strategies. Computational gain Remarks: 1 the dual solution α is not necessarily unique there might be several possible sets of support vectors. 2 How do we determine b? SVM, kernel methods and multiclass 18/23
SVM summary so far Optimization problem formulated as a strongly convex QP whose dual is also a QP The support vectors are the points that have a non zero optimal weight α i The optimal solution is w = i S α i y ix i, i.e. a weighted combination of the support vectors The solution does not depend on the well-classified points Leads to working set strategies. Computational gain Remarks: 1 the dual solution α is not necessarily unique there might be several possible sets of support vectors. 2 How do we determine b? SVM, kernel methods and multiclass 18/23
Representer property for the SVM f (x) = w x + b = i S α i y i x i x + b = i S α i y i k(x i, x) + b with k(x, x ) = x x. SVM, kernel methods and multiclass 19/23
Representer property for the SVM f (x) = w x + b = i S α i y i x i x + b = i S α i y i k(x i, x) + b with k(x, x ) = x x. Eventually, this whole formulation depends only on the dot product between points SVM, kernel methods and multiclass 19/23
Representer property for the SVM f (x) = w x + b = i S α i y i x i x + b = i S α i y i k(x i, x) + b with k(x, x ) = x x. Eventually, this whole formulation depends only on the dot product between points Can we use another dot product than the one associated to the usual Euclidean distance in R p? SVM, kernel methods and multiclass 19/23
Hinge loss interpretation of the SVM min w,ξ 1 2 w 2 + C s.t. i, ξ i { y i (w x i + b) 1 ξ i ξ i 0 SVM, kernel methods and multiclass 20/23
Hinge loss interpretation of the SVM min w,ξ 1 2 w 2 + C s.t. i, ξ i { ξ i 1 y i (w x i + b) ξ i 0 min w 1 2 w 2 + C max ( 1 y i (w x i + b), 0 ) Define the hinge loss l(a, y) = (1 ya) + with (u) + = max(u, 0). SVM, kernel methods and multiclass 20/23
Hinge loss interpretation of the SVM min w,ξ 1 2 w 2 + C s.t. i, ξ i { y i (w x i + b) 1 ξ i ξ i 0 min w 1 2 w 2 + C max ( 1 y i (w x i + b), 0 ) Define the hinge loss l(a, y) = (1 ya) + with (u) + = max(u, 0). Our problem is now of the form min l(f (x i ), y i ) + 1 w 2C w 2 with f (x) = w x + b. SVM, kernel methods and multiclass 20/23
Hinge loss vs other losses The hinge loss is the least convex loss which upper bounds the 0-1 loss and equals 0 for large scores. SVM, kernel methods and multiclass 21/23
SVM with the quadratic hinge loss Quadratic hinge loss: l(a, y) = (1 ya) 2 +. SVM, kernel methods and multiclass 22/23
SVM with the quadratic hinge loss Quadratic hinge loss: l(a, y) = (1 ya) 2 +. Quadratic SVM min w 1 2 w 2 + C max ( 1 y i (w x i + b), 0 ) 2 SVM, kernel methods and multiclass 22/23
SVM with the quadratic hinge loss Quadratic hinge loss: l(a, y) = (1 ya) 2 +. Quadratic SVM min w 1 2 w 2 + C max ( 1 y i (w x i + b), 0 ) 2 min w,ξ 1 2 w 2 + C s.t. i, ξ 2 i { y i (w x i + b) 1 ξ i ξ i 0 SVM, kernel methods and multiclass 22/23
SVM with the quadratic hinge loss Quadratic hinge loss: l(a, y) = (1 ya) 2 +. Quadratic SVM min w 1 2 w 2 + C max ( 1 y i (w x i + b), 0 ) 2 min w,ξ 1 2 w 2 + C s.t. i, ξ 2 i { y i (w x i + b) 1 ξ i ξ i 0 Penalizes more strongly misclassified points Less robust to outliers Tends to be less sparse Score in [0, 1] for n large, interpretable as a probability. SVM, kernel methods and multiclass 22/23
Imbalanced classification SVM, kernel methods and multiclass 23/23
Imbalanced classification Learn a binary classifier from (x i, y i ) pairs with P = {i y i = 1} N = {i y i = 1}, n + = P, n = N and with n + n. SVM, kernel methods and multiclass 23/23
Imbalanced classification Learn a binary classifier from (x i, y i ) pairs with P = {i y i = 1} N = {i y i = 1}, n + = P, n = N and with n + n. Problem: to minimize the number of mistakes the classifier learnt might classify all points as negatives. SVM, kernel methods and multiclass 23/23
Imbalanced classification Learn a binary classifier from (x i, y i ) pairs with P = {i y i = 1} N = {i y i = 1}, n + = P, n = N and with n + n. Problem: to minimize the number of mistakes the classifier learnt might classify all points as negatives. Some ways to address the issue Subsample the negatives, and learn an ensemble of classifiers. SVM, kernel methods and multiclass 23/23
Imbalanced classification Learn a binary classifier from (x i, y i ) pairs with P = {i y i = 1} N = {i y i = 1}, n + = P, n = N and with n + n. Problem: to minimize the number of mistakes the classifier learnt might classify all points as negatives. Some ways to address the issue Subsample the negatives, and learn an ensemble of classifiers. Introduce different costs for the positives and negatives 1 min w R p 2 w 2 2 + C + ξ i + C ξ i i P i N s.t. i, y i (w x i + b) 1 ξ i SVM, kernel methods and multiclass 23/23
Imbalanced classification Learn a binary classifier from (x i, y i ) pairs with P = {i y i = 1} N = {i y i = 1}, n + = P, n = N and with n + n. Problem: to minimize the number of mistakes the classifier learnt might classify all points as negatives. Some ways to address the issue Subsample the negatives, and learn an ensemble of classifiers. Introduce different costs for the positives and negatives 1 min w R p 2 w 2 2 + C + ξ i + C ξ i i P i N s.t. i, y i (w x i + b) 1 ξ i Naive choice: C + = C/n + and C = C/n Is suboptimal in theory and in practice!! Better to search for the optimal hyperparameter pair (C +, C ). SVM, kernel methods and multiclass 23/23