SVM and Kernel machine Stéphane Canu stephane.canu@litislab.eu Escuela de Ciencias Informáticas 20 July 3, 20
Road map Linear SVM Linear classification The margin Linear SVM: the problem Optimization in 5 slides Dual formulation of the linear SVM Solving the dual The non separable case
Linear classification Find a line to separate (classify) blue from red D(x) = sign ( v x+a )
Linear classification Find a line to separate (classify) blue from red D(x) = sign ( v x+a ) the decision border: v x+a = 0
Linear classification Find a line to separate (classify) blue from red D(x) = sign ( v x+a ) the decision border: v x+a = 0 How to choose a solution? there are many solutions... The problem is ill posed
This is not the problem we want to solve {(x i, y i ); i = : n} a training sample, i.i.d. drawn according to IP(x, y) unknown we want to be able to classify new observations: imize IP(error)
This is not the problem we want to solve {(x i, y i ); i = : n} a training sample, i.i.d. drawn according to IP(x, y) unknown we want to be able to classify new observations: imize IP(error) Looking for a universal approach use training data: (a few errors) prove IP(error) remains small non linear scalable - algorithmic complexity
Vapnik s Book, 982 This is not the problem we want to solve {(x i, y i ); i = : n} a training sample, i.i.d. drawn according to IP(x, y) unknown with large probability: IP(error) < ÎP(error) }{{} =0 here we want to be able to classify new observations: imize IP(error) Looking for a universal approach use training data: (a few errors) prove IP(error) remains small non linear scalable - algorithmic complexity + ϕ( v }{{} margin )
Statistical machine learning Computation learning theory (COLT) {x {x i, y i } i i =, n A f = v x+a x Loss L y p = f(x) ÎP(error) = n L(f(x i), y i ) Vapnik s Book, 982
Statistical machine learning Computation learning theory (COLT) x P IP {x {x i, y i } i i =, n A f = v x+a y Loss L y p = f(x) IP P ( IP(error) ÎP(error) ) Prob = = IE(L) n L(f(x + ϕ( v ) δ i), y i ) Vapnik s Book, 982
linear discriation Find a line to classify blue and red D(x) = sign ( v x+a ) the decision border: v x+a = 0 there are many solutions... The problem is ill posed How to choose a solution? choose the one with larger margin
Road map Linear SVM Linear classification The margin Linear SVM: the problem Optimization in 5 slides Dual formulation of the linear SVM Solving the dual The non separable case
Maximize our confidence = maximize the margin the decision border: (v, a) = {x IR d v x+a = 0} maximize the margin max v,a Maximize the margin m max v,a with dist(x i, (v, a)) i [,n] }{{} margin: m v x i + a m,n v the problem is still ill posed if (v, a) is a solution, 0 < k (kv, ka) is also a solution...
Margin and distance: details dist(x i, (v, a)) = arg s { s IR d 2 x i s 2 with v s+a = 0 x i s L(s,λ) = 2 x i s 2 +λ(v s+a) sl(s,λ) = x i + s+λv sl(s,λ) = 0 s = x i λv { v s+a = 0 s = x i λv = v (x i λv)+a = 0 = λ = v x i + a v v s = x i v x i + a v v = x i s 2 = v arg s ( v ) 2 x i + a v v v v x i s = v x i + a v
Margin, norm and regularity w T x max w,b Valeur de la marge dans le cas monodimensionnel {x w T x = 0} / w m + < marge x / w > Maximize the margin max m v,a with,n v x i + a m v 2 = if the is greater, everybody is greater (y i {, }) max v,a with change variable: w = v m and b = a m with y i (w x i + b) ; i =, n and m = w m y i (v x i + a) m, i =, n v 2 = = w = v m w,b w 2 with y i (w x i + b) i =, n
Road map Linear SVM Linear classification The margin Linear SVM: the problem Optimization in 5 slides Dual formulation of the linear SVM Solving the dual The non separable case
Linear SVM: the problem Linear SVM are the solution of the following problem (called primal) Let {(x i, y i ); i = : n} be a set of labelled data with x i IR d, y i {, }. A support vector machine (SVM) is a linear classifier associated with the following decision function: D(x) = sign ( w x+b ) where w IR d and b IR a given thought the solution of the following problem: w,b 2 w 2 = 2 w w with y i (w x i + b) i =, n This is a quadratic program (QP): { z 2 z Az d z with Bz e [ I 0 z = (w, b), d = (0,...,0), A = 0 0 ], B = [yx,y] et e = (,...,)
The linear discriation problem from Learning with Kernels, B. Schölkopf and A. Smolla, MIT Press, 2002.
Road map Linear SVM Linear classification The margin Linear SVM: the problem Optimization in 5 slides Dual formulation of the linear SVM Solving the dual The non separable case
First order optimality condition () problem P = x IR n with and J(x) h j (x) = 0 j =,...,p g i (x) 0 i =,...,q Definition: Karush, Kuhn and Tucker (KKT) conditions p q stationarity J(x )+ λ j h j (x )+ µ i g i (x ) = 0 j= primal admissibility h j (x ) = 0 j =,...,p g i (x ) 0 i =,...,q dual admissibility µ i 0 i =,...,q complementarity µ i g i (x ) = 0 i =,...,q λ j and µ i are called the Lagrange multipliers of problem P
First order optimality condition (2) Theorem (2. Nocedal & Wright pp 32) If a vector x is a stationary point of problem P Then there exists a Lagrange multipliers such that ( x,{λ j } j=:p,{µ i } :q ) fulfill KKT conditions a under some conditions e.g. linear independence constraint qualification If the problem is convex, then a stationary point is the solution of the problem A quadratic program (QP) is convex when... { (QP) z 2 z Az d z with Bz e...when matrix A is positive definite
KKT condition - Lagrangian (3) problem P = x IR n with and J(x) h j (x) = 0 j =,...,p g i (x) 0 i =,...,q Definition: Lagrangian The lagrangian of problem P is the following function: p q L(x,λ,µ) = J(x)+ λ j h j (x)+ µ i g i (x) j= The importance of being a lagrangian the stationarity condition can be written: L(x,λ,µ) = 0 the lagrangian saddle point max L(x,λ,µ) λ,µ x Primal variables: x and dual variables λ,µ (the Lagrange multipliers)
Duality definitions () Primal and (Lagrange) dual problems J(x) { x IR n P = with h j (x) = 0 j =, p D = with and g i (x) 0 i =, q max λ IR p,µ IR q Q(λ,µ) µ j 0 j =, q Dual objective function: Q(λ,µ) = inf x = inf x L(x,λ,µ) p J(x)+ λ j h j (x)+ j= Wolf dual problem max L(x,λ,µ) x,λ IR p,µ IR q with µ j 0 j =, q W = p and J(x )+ λ j h j (x )+ j= q µ i g i (x) q µ i g i (x ) = 0
Duality theorems (2) Theorem (2.2, 2.3 and 2.4 Nocedal & Wright pp 346) If f, g and h are convex and continuously differentiable a Then the solution of the dual problem is the same as the solution of the primal a under some conditions e.g. linear independence constraint qualification (λ,µ ) = solution of problem D x = arg L(x,λ,µ ) x
Road map Linear SVM Linear classification The margin Linear SVM: the problem Optimization in 5 slides Dual formulation of the linear SVM Solving the dual The non separable case
Linear SVM dual formulation - The lagrangian { w,b 2 w 2 with y i (w x i + b) Looking for the lagrangian saddle point max α lagrange multipliers α i 0 L(w, b,α) = 2 w 2 w,b i =, n L(w, b,α) with so called ( α i yi (w x i + b) ) α i represents the influence of constraint thus the influence of the training example (x i, y i )
Optimality conditions L(w, b,α) = 2 w 2 Computing the gradients: ( α i yi (w x i + b) ) w L(w, b,α) = w L(w, b,α) b we have the following optimality conditions w L(w, b,α) = 0 w = L(w, b,α) b = 0 α i y i x i = n α i y i α i y i x i α i y i = 0
Linear SVM dual formulation Optimality: w = L(α) = 2 = 2 L(w, b,α) = 2 w 2 α i y i x i j= α j α i y i y j x j x i }{{} j= w w ( α i yi (w x i + b) ) α i y i = 0 α j α i y i y j x j x i + n α iy i α i α j y j x j x i b j= }{{} w Dual linear SVM is also a quadratic program α IR n 2 α Gα e α problem D with y α = 0 and 0 α i i =, n with G a symmetric matrix n n such that G ij = y i y j x j x i α i y i }{{} =0 + n α i
SVM primal vs. dual Primal Dual w IR d,b IR 2 w 2 with y i (w x i + b) i =, n d + unknown n constraints classical QP perfect when d << n α IR n 2 α Gα e α with y α = 0 and 0 α i i =, n n unknown G Gram matrix (pairwise influence matrix) n box constraints easy to solve to be used when d > n
SVM primal vs. dual Primal Dual w IR d,b IR 2 w 2 with y i (w x i + b) i =, n d + unknown n constraints classical QP perfect when d << n f(x) = d w j x j + b = j= α IR n 2 α Gα e α with y α = 0 and 0 α i i =, n n unknown G Gram matrix (pairwise influence matrix) n box constraints easy to solve to be used when d > n α i y i (x x i )+b
Cold case: the least square problem Linear model d y i = w j x ij +ε i, i =, n j= n data and d variables; d < n 2 d = x ij w j y i = Xw y 2 w j= Solution: w = (X X) X y f(x) = x (X X) X y }{{} w What is the influence of each data point (matrix X lines)? Shawe-Taylor & Cristianini s Book, 2004
data point influence (contribution) for any new data point x f(x) = x (X X)(X X) (X X) X y }{{} w = x X X(X X) (X X) X y }{{} α x d variables n examples X α w f(x) = d w j x j j=
data point influence (contribution) for any new data point x f(x) = x (X X)(X X) (X X) X y }{{} w = x X X(X X) (X X) X y }{{} α x d variables n examples X x x i α w f(x) = d w j x j = j= α i (x x i ) from variables to examples α = X(X X) w }{{} n examples et w = X α }{{} d variables what if d n!
SVM primal vs. dual Primal Dual w IR d,b IR 2 w 2 with y i (w x i + b) i =, n d + unknown n constraints classical QP perfect when d << n f(x) = d w j x j + b = j= α IR n 2 α Gα e α with y α = 0 and 0 α i i =, n n unknown G Gram matrix (pairwise influence matrix) n box constraints easy to solve to be used when d > n α i y i (x x i )+b
Road map Linear SVM Linear classification The margin Linear SVM: the problem Optimization in 5 slides Dual formulation of the linear SVM Solving the dual The non separable case
Solving the dual () Data point influence α i = 0 this point is useless α i 0 this point is said to be support f(x) = d w j x j + b = j= α i y i (x x i )+b
Solving the dual () Data point influence α i = 0 this point is useless α i 0 this point is said to be support f(x) = d w j x j + b = 3 α i y i (x x i )+b j= Decison border only depends on 3 points (d + )
Solving the dual (2) Assume we know these 3 data points α IR n 2 α Gα e α with y α = 0 and 0 α i i =, n = { α IR 3 2 α Gα e α with y α = 0 L(α, b) = 2 α Gα e α+b y α solve the following linear system { Gα + b y = e y α = 0 U = chol(g); % upper a = U\ (U \e); c = U\ (U \y); b = (y *a)\(y *c) alpha = U\ (U \(e - b*y));
Road map Linear SVM Linear classification The margin Linear SVM: the problem Optimization in 5 slides Dual formulation of the linear SVM Solving the dual The non separable case
The non separable case: a bi criteria optimization problem Modeling potential errors: introducing slack variables ξ i (x i, y i ) { no error: yi (w x i + b) ξ i = 0 error: ξ i = y i (w x i + b) > 0 w,b,ξ w,b,ξ 2 w 2 C p ξ p i with y i (w x i + b) ξ i ξ i 0 i =, n Our hope: almost all ξ i = 0
The non separable case Modeling potential errors: introducing slack variables ξ i (x i, y i ) { no error: yi (w x i + b) ξ i = 0 error: ξ i = y i (w x i + b) > 0 Minimizing also the slack (the error), for a given C > 0 w,b,ξ 2 w 2 + C ξ p i p with y i (w x i + b) ξ i i =, n ξ i 0 i =, n Looking for the saddle point of the lagrangian with the Lagrange multipliers α i 0 and β i 0 L(w, b,α,β) = 2 w 2 + C p ξ p i ( α i yi (w ) x i + b) +ξ i β i ξ i
Optimality conditions (p = ) L(w, b,α,β) = 2 w 2 + C Computing the gradients: ξ i ( α i yi (w ) x i + b) +ξ i β i ξ i w L(w, b,α) = w L(w, b,α) b = α i y i x i α i y i ξi L(w, b,α) = C α i β i no change for w and b β i 0 and C α i β i = 0 α i C The dual formulation: α IR n 2 α Gα e α with y α = 0 and 0 α i C i =, n
SVM primal vs. dual Primal Dual w,b,ξ IR n 2 w 2 + C with ξ i y i (w x i + b) ξ i ξ i 0 i =, n d + n+ unknown 2n constraints classical QP to be used when n is too large to build G α IR n 2 α Gα e α with y α = 0 and 0 α i C i =, n n unknown G Gram matrix (pairwise influence matrix) 2n box constraints easy to solve to be used when n is not too large
Optimality conditions (p = 2) L(w, b,α,β) = 2 w 2 + C 2 Computing the gradients: no change for w and b Cξ i α i β i = 0 C 2 ξi 2 ( α i yi (w ) x i + b) +ξ i β i ξ i w L(w, b,α) = w L(w, b,α) b = α i y i x i α i y i ξi L(w, b,α) = Cξ i α i β i n ξ2 i n α iξ i n β iξ i = 2C n α2 i The dual formulation: α IR n 2 α (G + C I)α e α with y α = 0 and 0 α i i =, n
SVM primal vs. dual Primal w,b,ξ IR n 2 w 2 + C with ξ 2 i y i (w x i + b) ξ i ξ i 0 i =, n d + n+ unknown 2n constraints classical QP to be used when n is too large to build G Dual α IR n 2 α (G + C I)α e α with y α = 0 and 0 α i i =, n n unknown G Gram matrix is regularized n box constraints easy to solve to be used when n is not too large
Eliating the slack but not the possible mistakes w,b,ξ IR n 2 w 2 + C 2 with ξ 2 i y i (w x i + b) ξ i ξ i 0 i =, n Introducing the hinge loss ξ i = max ( y i (w x i + b), 0 ) w,b 2 w 2 + C 2 max ( y i (w x i + b), 0 ) 2 Back to d + variables, but this is no longer an explicit QP
SVM (imizing the hinge loss) with variable selection LP SVM w,b w + C max ( y i (w x i + b), 0 ) Lasso SVM w,b w + C 2 max ( y i (w x i + b), 0 ) 2 Elastic net SVM (doubly regularized with p= or p=2) w,b w + 2 w 2 + C max ( y i (w x i + b), 0 ) p Danzig selector for SVM, Generalized Garrote
Conclusion: variables or data point? seeking for a universal learning algorithm no model for IP(x, y) the linear case: data is separable the non separable case double objective: imizing the error together with the regularity of the solution multi objective optimisation dualiy : variable example use the primal when d < n (in the liner case) or when matrix G is hard to compute otherwise use the dual universality = nonlinearity kernels