Lecture 2: Linear SVM in the Dual

Lecture 2: Linear SVM in the Dual Stéphane Canu stephane.canu@litislab.eu São Paulo 2015 July 22, 2015

Road map 1 Linear SVM Optimization in 10 slides Equality constraints Inequality constraints Dual formulation of the linear SVM Solving the dual Figure from L. Bottou & C.J. Lin, Support vector machine solvers, in Large scale kernel machines, 2007.

Linear SVM: the problem Linear SVM are the solution of the following problem (called primal) Let {(x i, y i ); i = 1 : n} be a set of labelled data with x i IR d, y i {1, 1}. A support vector machine (SVM) is a linear classifier associated with the following decision function: D(x) = sign ( w x + b ) where w IR d and b IR a given thought the solution of the following problem: { w,b 1 2 w 2 = 1 2 w w with y i (w x i + b) 1 i = 1, n This is a quadratic program (QP): { z 1 2 z Az d z with Bz e [ I 0 z = (w, b), d = (0,..., 0), A = 0 0 ], B = [diag(y)x, y] et e = (1,..., 1)

Road map 1 Linear SVM Optimization in 10 slides Equality constraints Inequality constraints Dual formulation of the linear SVM Solving the dual

A simple example (to begin with) { x 1,x 2 J(x) = (x 1 a) 2 + (x 2 b) 2 with x J(x) x x iso cost lines: J(x) = k

A simple example (to begin with) { x 1,x 2 J(x) = (x 1 a) 2 + (x 2 b) 2 with H(x) = α(x 1 c) 2 + β(x 2 d) 2 + γx 1 x 2 1 x J(x) Ω = {x H(x) 0} x x H(x) x x tangent hyperplane iso cost lines: J(x) = k x H(x) = λ x J(x)

The only one equality constraint case { x J(x) J(x + εd) J(x) + ε x J(x) d with H(x) = 0 H(x + εd) H(x) + ε x H(x) d Loss J : d is a descent direction if it exists ε 0 IR such that ε IR, 0 < ε ε 0 J(x + εd) < J(x) x J(x) d < 0 constraint H : d is a feasible descent direction if it exists ε 0 IR such that ε IR, 0 < ε ε 0 H(x + εd) = 0 x H(x) d = 0 If at x, vectors x J(x ) and x H(x ) are collinear there is no feasible descent direction d. Therefore, x is a local solution of the problem.

Lagrange multipliers Assume J and functions H i are continuously differentials (and independent) J(x) x IR n avec H 1 (x) = 0 P = et H 2 (x) = 0... H p (x) = 0

Lagrange multipliers Assume J and functions H i are continuously differentials (and independent) J(x) x IR n avec H 1 (x) = 0 λ 1 P = et H 2 (x) = 0 λ 2... H p (x) = 0 each constraint is associated with λ i : the Lagrange multiplier. λ p Theorem (First order optimality conditions) for x being a local ima of P, it is necessary that: p x J(x ) + λ i x H i (x ) = 0 and H i (x ) = 0, i = 1, p

Plan 1 Linear SVM Optimization in 10 slides Equality constraints Inequality constraints Dual formulation of the linear SVM Solving the dual Stéphane Canu (INSA Rouen - LITIS) July 22, 2015 8 / 32

The only one inequality constraint case { J(x) J(x + εd) J(x) + ε x J(x) d x with G(x) 0 G(x + εd) G(x) + ε x G(x) d cost J : d is a descent direction if it exists ε 0 IR such that ε IR, 0 < ε ε 0 J(x + εd) < J(x) x J(x) d < 0 constraint G : d is a feasible descent direction if it exists ε 0 IR such that ε IR, 0 < ε ε 0 G(x + εd) 0 G(x) < 0 : no limit here on d G(x) = 0 : x G(x) d 0 Two possibilities If x lies at the limit of the feasible domain (G(x ) = 0) and if vectors x J(x ) and x G(x ) are collinear and in opposite directions, there is no feasible descent direction d at that point. Therefore, x is a local solution of the problem... Or if x J(x ) = 0

Two possibilities for optimality x J(x ) = µ x G(x ) and µ > 0; G(x ) = 0 or x J(x ) = 0 and µ = 0; G(x ) < 0 This alternative is summarized in the so called complementarity condition: µ G(x ) = 0 µ = 0 G(x ) < 0 G(x ) = 0 µ > 0

First order optimality condition (1) problem P = x IR n J(x) with h j (x) = 0 j = 1,..., p and g i (x) 0 i = 1,..., q Definition: Karush, Kuhn and Tucker (KKT) conditions p q stationarity J(x ) + λ j h j (x ) + µ i g i (x ) = 0 j=1 primal admissibility h j (x ) = 0 j = 1,..., p g i (x ) 0 i = 1,..., q dual admissibility µ i 0 i = 1,..., q complementarity µ i g i (x ) = 0 i = 1,..., q λ j and µ i are called the Lagrange multipliers of problem P

First order optimality condition (2) Theorem (12.1 Nocedal & Wright pp 321) If a vector x is a stationary point of problem P Then there exists a Lagrange multipliers such that ( x, {λ j } j=1:p, {µ i } :q ) fulfill KKT conditions a under some conditions e.g. linear independence constraint qualification If the problem is convex, then a stationary point is the solution of the problem A quadratic program (QP) is convex when... { 1 (QP) z 2 z Az d z with Bz e... when matrix A is positive definite

KKT condition - Lagrangian (3) problem P = x IR n J(x) with h j (x) = 0 j = 1,..., p and g i (x) 0 i = 1,..., q Definition: Lagrangian The lagrangian of problem P is the following function: p q L(x, λ, µ) = J(x) + λ j h j (x) + µ i g i (x) j=1 The importance of being a lagrangian the stationarity condition can be written: L(x, λ, µ) = 0 the lagrangian saddle point max λ,µ x L(x, λ, µ) Primal variables: x and dual variables λ, µ (the Lagrange multipliers)

Duality definitions (1) Primal and (Lagrange) dual problems J(x) { x IR n P = with h j (x) = 0 j = 1, p D = and g i (x) 0 i = 1, q Dual objective function: Q(λ, µ) = inf x = inf x L(x, λ, µ) p J(x) + λ j h j (x) + j=1 Wolf dual problem max L(x, λ, µ) x,λ IR p,µ IR q with µ j 0 j = 1, q W = p and x J(x) + λ j x h j (x) + j=1 max λ IR p,µ IR q Q(λ, µ) with µ j 0 j = 1, q q µ i g i (x) q µ i x g i (x) = 0

Duality theorems (2) Theorem (12.12, 12.13 and 12.14 Nocedal & Wright pp 346) If f, g and h are convex and continuously differentiable a, then the solution of the dual problem is the same as the solution of the primal a under some conditions e.g. linear independence constraint qualification (λ, µ ) = solution of problem D x = arg L(x, λ, µ ) x Q(λ, µ ) = arg x L(x, λ, µ ) = L(x, λ, µ ) = J(x ) + λ H(x ) + µ G(x ) = J(x ) and for any feasible point x Q(λ, µ) J(x) 0 J(x) Q(λ, µ) The duality gap is the difference between the primal and dual cost functions

Road map 1 Linear SVM Optimization in 10 slides Dual formulation of the linear SVM Solving the dual Figure from L. Bottou & C.J. Lin, Support vector machine solvers, in Large scale kernel machines, 2007.

Linear SVM dual formulation - The lagrangian { w,b 1 2 w 2 with y i (w x i + b) 1 i = 1, n Looking for the lagrangian saddle point max α lagrange multipliers α i 0 L(w, b, α) = 1 2 w 2 w,b L(w, b, α) with so called ( α i yi (w x i + b) 1 ) α i represents the influence of constraint thus the influence of the training example (x i, y i )

Stationarity conditions L(w, b, α) = 1 2 w 2 Computing the gradients: ( α i yi (w x i + b) 1 ) w L(w, b, α) L(w, b, α) b we have the following optimality conditions = w w L(w, b, α) = 0 w = L(w, b, α) b = 0 α i y i x i = n α i y i α i y i x i α i y i = 0

KKT conditions for SVM stationarity w α i y i x i = 0 and α i y i = 0 primal admissibility y i (w x i + b) 1 i = 1,..., n dual admissibility α i 0 i = 1,..., n ) complementarity α i (y i (w x i + b) 1 = 0 i = 1,..., n The complementary condition split the data into two sets A be the set of active constraints: usefull points A = {i [1, n] y i (w x i + b ) = 1} its complementary Ā useless points if i / A, α i = 0

The KKT conditions for SVM The same KKT but using matrix notations and the active set A stationarity w X D y α = 0 α y = 0 primal admissibility D y (Xw + b I1) I1 dual admissibility α 0 complementarity D y (X A w + b I1 A ) = I1 A α Ā = 0 Knowing A, the solution verifies the following linear system: w XA D y α A = 0 D y X A w by A = e A ya α A = 0 with D y = diag(y A ), α A = α(a), y A = y(a) et X A = X (X A ; :).

The KKT conditions as a linear system w X A D y α A = 0 D y X A w by A = e A y A α A = 0 with D y = diag(y A ), α A = α(a), y A = y(a) et X A = X (X A ; :). I X A D y 0 w 0 D y X A 0 y A α A = e A 0 y A 0 b 0 we can work on it to separate w from (α A, b)

The SVM dual formulation The SVM Wolfe dual max w,b,α using the fact: w = 1 2 w 2 ( α i yi (w x i + b) 1 ) with α i 0 i = 1,..., n and w α i y i x i = 0 and α i y i = 0 α i y i x i The SVM Wolfe dual without w and b max 1 α 2 α j α i y i y j x j x i + α i j=1 with α i 0 i = 1,..., n and α i y i = 0

Linear SVM dual formulation Optimality: w = L(α) = 1 2 = 1 2 L(w, b, α) = 1 2 w 2 α i y i x i j=1 α j α i y i y j x j x i }{{} j=1 w w ( α i yi (w x i + b) 1 ) α i y i = 0 α j α i y i y j x j x i + n α iy i α i α j y j x j x i b j=1 } {{ } w Dual linear SVM is also a quadratic program 1 α IR n 2 α Gα e α problem D with y α = 0 and 0 α i i = 1, n with G a symmetric matrix n n such that G ij = y i y j x j x i α i y i } {{ } =0 + n α i

SVM primal vs. dual Primal Dual w IR d,b IR 1 2 w 2 with y i (w x i + b) 1 i = 1, n d + 1 unknown n constraints classical QP perfect when d << n 1 α IR n 2 α Gα e α with y α = 0 and 0 α i i = 1, n n unknown G Gram matrix (pairwise influence matrix) n box constraints easy to solve to be used when d > n

SVM primal vs. dual Primal Dual w IR d,b IR 1 2 w 2 with y i (w x i + b) 1 i = 1, n d + 1 unknown n constraints classical QP perfect when d << n f (x) = d w j x j + b = j=1 1 α IR n 2 α Gα e α with y α = 0 and 0 α i i = 1, n n unknown G Gram matrix (pairwise influence matrix) n box constraints easy to solve to be used when d > n α i y i (x x i ) + b

The bi dual (the dual of the dual) 1 α IR n 2 α Gα e α with y α = 0 and 0 α i i = 1, n The bidual L(α, λ, µ) = 1 2 α Gα e α + λ y α µ α α L(α, λ, µ) = Gα e + λ y µ max α,λ,µ 1 2 α Gα with Gα e + λ y µ = 0 and 0 µ since w 2 = 1 2 α Gα and DX w = Gα { max w,λ with 1 2 w 2 DX w + λ y e by identification (possibly up to a sign) b = λ is the Lagrange multiplier of the equality constraint

Cold case: the least square problem Linear model d y i = w j x ij + ε i, i = 1, n j=1 n data and d variables; d < n 2 d = x ij w j y i = X w y 2 w j=1 Solution: w = (X X ) 1 X y f (x) = x (X X ) 1 X y }{{} w What is the influence of each data point (matrix X lines)? Shawe-Taylor & Cristianini s Book, 2004

data point influence (contribution) for any new data point x f (x) = x (X X )(X X ) 1 (X X ) 1 X y }{{} w = x X X (X X ) 1 (X X ) 1 X y }{{} α x d variables n examples X α w f (x) = d w j x j j=1

data point influence (contribution) for any new data point x f (x) = x (X X )(X X ) 1 (X X ) 1 X y }{{} w = x X X (X X ) 1 (X X ) 1 X y }{{} α x d variables n examples X x x i α w f (x) = d w j x j = j=1 α i (x x i ) from variables to examples α = X (X X ) 1 w }{{} n examples et w = X α }{{} d variables what if d n!

Solving the dual (1) Data point influence α i = 0 this point is useless α i 0 this point is said to be support f (x) = d w j x j + b = j=1 α i y i (x x i ) + b

Solving the dual (1) Data point influence α i = 0 this point is useless α i 0 this point is said to be support f (x) = d w j x j + b = 3 α i y i (x x i ) + b j=1 Decison border only depends on 3 points (d + 1)

Solving the dual (2) Assume we know these 3 data points 1 α IR n 2 α Gα e α with y α = 0 and 0 α i i = 1, n = { α IR 3 1 2 α Gα e α with y α = 0 L(α, b) = 1 2 α Gα e α + b y α solve the following linear system { Gα + b y = e y α = 0 U = chol(g); % upper a = U\ (U \e); c = U\ (U \y); b = (y *a)\(y *c) alpha = U\ (U \(e - b*y));

Conclusion: variables or data point? seeking for a universal learning algorithm no model for IP(x, y) the linear case: data is separable the non separable case double objective: imizing the error together with the regularity of the solution multi objective optimisation dualiy : variable example use the primal when d < n (in the liner case) or when matrix G is hard to compute otherwise use the dual universality = nonlinearity kernels