Lecture 2: Linear SVM in the Dual

Similar documents
SVM and Kernel machine

Lecture 6: Minimum encoding ball and Support vector data description (SVDD)

ICS-E4030 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Lecture 18: Optimization Programming

Support vector machines

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machines: Maximum Margin Classifiers

Robust minimum encoding ball and Support vector data description (SVDD)

Support Vector Machines and Kernel Methods

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machines for Classification and Regression

Machine Learning. Support Vector Machines. Manfred Huber

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

Lecture Notes on Support Vector Machine

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Support Vector Machine

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines for Regression

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) and Kernel Methods

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Convex Optimization and Support Vector Machine

Introduction to Support Vector Machines

Lectures 9 and 10: Constrained optimization problems and their optimality conditions

Constrained Optimization and Lagrangian Duality

Support Vector Machines

Constrained Optimization

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines

Support Vector Machines

LECTURE 7 Support vector machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Optimality, Duality, Complementarity for Constrained Optimization

Support Vector Machine

Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Statistical Machine Learning from Data

Pattern Classification, and Quadratic Problems

UVA CS 4501: Machine Learning

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Machine Learning And Applications: Supervised Learning-SVM

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

Optimization. Yuh-Jye Lee. March 28, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 40

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Machine Learning A Geometric Approach

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Generalization to inequality constrained problem. Maximize

Support Vector Machines, Kernel SVM

CS798: Selected topics in Machine Learning

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

The Karush-Kuhn-Tucker conditions

Support Vector Machines

Announcements - Homework

Lecture 9: Multi Kernel SVM

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machine (continued)

Non-linear Support Vector Machines

Support Vector Machine I

Convex Optimization and Modeling

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Convex Optimization & Lagrange Duality

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

More on Lagrange multipliers

Appendix A Taylor Approximations and Definite Matrices

Linear & nonlinear classifiers

Convex Optimization and SVM

Constrained optimization

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

CSC 411 Lecture 17: Support Vector Machine

4TE3/6TE3. Algorithms for. Continuous Optimization

Optimization for Communications and Networks. Poompat Saengudomlert. Session 4 Duality and Lagrange Multipliers

Lecture 10: A brief introduction to Support Vector Machine

Lecture 3 January 28

Nonlinear Optimization

Incorporating detractors into SVM classification

Lagrange Relaxation and Duality

Lecture: Duality of LP, SOCP and SDP

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

ML (cont.): SUPPORT VECTOR MACHINES

Max Margin-Classifier

Optimization Problems with Constraints - introduction to theory, numerical Methods and applications

SF2822 Applied Nonlinear Optimization. Preparatory question. Lecture 9: Sequential quadratic programming. Anders Forsgren

Chap 2. Optimality conditions

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

1. f(β) 0 (that is, β is a feasible point for the constraints)

Transcription:

Lecture 2: Linear SVM in the Dual Stéphane Canu stephane.canu@litislab.eu São Paulo 2015 July 22, 2015

Road map 1 Linear SVM Optimization in 10 slides Equality constraints Inequality constraints Dual formulation of the linear SVM Solving the dual Figure from L. Bottou & C.J. Lin, Support vector machine solvers, in Large scale kernel machines, 2007.

Linear SVM: the problem Linear SVM are the solution of the following problem (called primal) Let {(x i, y i ); i = 1 : n} be a set of labelled data with x i IR d, y i {1, 1}. A support vector machine (SVM) is a linear classifier associated with the following decision function: D(x) = sign ( w x + b ) where w IR d and b IR a given thought the solution of the following problem: { w,b 1 2 w 2 = 1 2 w w with y i (w x i + b) 1 i = 1, n This is a quadratic program (QP): { z 1 2 z Az d z with Bz e [ I 0 z = (w, b), d = (0,..., 0), A = 0 0 ], B = [diag(y)x, y] et e = (1,..., 1)

Road map 1 Linear SVM Optimization in 10 slides Equality constraints Inequality constraints Dual formulation of the linear SVM Solving the dual

A simple example (to begin with) { x 1,x 2 J(x) = (x 1 a) 2 + (x 2 b) 2 with x J(x) x x iso cost lines: J(x) = k

A simple example (to begin with) { x 1,x 2 J(x) = (x 1 a) 2 + (x 2 b) 2 with H(x) = α(x 1 c) 2 + β(x 2 d) 2 + γx 1 x 2 1 x J(x) Ω = {x H(x) 0} x x H(x) x x tangent hyperplane iso cost lines: J(x) = k x H(x) = λ x J(x)

The only one equality constraint case { x J(x) J(x + εd) J(x) + ε x J(x) d with H(x) = 0 H(x + εd) H(x) + ε x H(x) d Loss J : d is a descent direction if it exists ε 0 IR such that ε IR, 0 < ε ε 0 J(x + εd) < J(x) x J(x) d < 0 constraint H : d is a feasible descent direction if it exists ε 0 IR such that ε IR, 0 < ε ε 0 H(x + εd) = 0 x H(x) d = 0 If at x, vectors x J(x ) and x H(x ) are collinear there is no feasible descent direction d. Therefore, x is a local solution of the problem.

Lagrange multipliers Assume J and functions H i are continuously differentials (and independent) J(x) x IR n avec H 1 (x) = 0 P = et H 2 (x) = 0... H p (x) = 0

Lagrange multipliers Assume J and functions H i are continuously differentials (and independent) J(x) x IR n avec H 1 (x) = 0 λ 1 P = et H 2 (x) = 0 λ 2... H p (x) = 0 each constraint is associated with λ i : the Lagrange multiplier. λ p

Lagrange multipliers Assume J and functions H i are continuously differentials (and independent) J(x) x IR n avec H 1 (x) = 0 λ 1 P = et H 2 (x) = 0 λ 2... H p (x) = 0 each constraint is associated with λ i : the Lagrange multiplier. λ p Theorem (First order optimality conditions) for x being a local ima of P, it is necessary that: p x J(x ) + λ i x H i (x ) = 0 and H i (x ) = 0, i = 1, p

Plan 1 Linear SVM Optimization in 10 slides Equality constraints Inequality constraints Dual formulation of the linear SVM Solving the dual Stéphane Canu (INSA Rouen - LITIS) July 22, 2015 8 / 32

The only one inequality constraint case { J(x) J(x + εd) J(x) + ε x J(x) d x with G(x) 0 G(x + εd) G(x) + ε x G(x) d cost J : d is a descent direction if it exists ε 0 IR such that ε IR, 0 < ε ε 0 J(x + εd) < J(x) x J(x) d < 0 constraint G : d is a feasible descent direction if it exists ε 0 IR such that ε IR, 0 < ε ε 0 G(x + εd) 0 G(x) < 0 : no limit here on d G(x) = 0 : x G(x) d 0 Two possibilities If x lies at the limit of the feasible domain (G(x ) = 0) and if vectors x J(x ) and x G(x ) are collinear and in opposite directions, there is no feasible descent direction d at that point. Therefore, x is a local solution of the problem... Or if x J(x ) = 0

Two possibilities for optimality x J(x ) = µ x G(x ) and µ > 0; G(x ) = 0 or x J(x ) = 0 and µ = 0; G(x ) < 0 This alternative is summarized in the so called complementarity condition: µ G(x ) = 0 µ = 0 G(x ) < 0 G(x ) = 0 µ > 0

First order optimality condition (1) problem P = x IR n J(x) with h j (x) = 0 j = 1,..., p and g i (x) 0 i = 1,..., q Definition: Karush, Kuhn and Tucker (KKT) conditions p q stationarity J(x ) + λ j h j (x ) + µ i g i (x ) = 0 j=1 primal admissibility h j (x ) = 0 j = 1,..., p g i (x ) 0 i = 1,..., q dual admissibility µ i 0 i = 1,..., q complementarity µ i g i (x ) = 0 i = 1,..., q λ j and µ i are called the Lagrange multipliers of problem P

First order optimality condition (2) Theorem (12.1 Nocedal & Wright pp 321) If a vector x is a stationary point of problem P Then there exists a Lagrange multipliers such that ( x, {λ j } j=1:p, {µ i } :q ) fulfill KKT conditions a under some conditions e.g. linear independence constraint qualification If the problem is convex, then a stationary point is the solution of the problem A quadratic program (QP) is convex when... { 1 (QP) z 2 z Az d z with Bz e... when matrix A is positive definite

KKT condition - Lagrangian (3) problem P = x IR n J(x) with h j (x) = 0 j = 1,..., p and g i (x) 0 i = 1,..., q Definition: Lagrangian The lagrangian of problem P is the following function: p q L(x, λ, µ) = J(x) + λ j h j (x) + µ i g i (x) j=1 The importance of being a lagrangian the stationarity condition can be written: L(x, λ, µ) = 0 the lagrangian saddle point max λ,µ x L(x, λ, µ) Primal variables: x and dual variables λ, µ (the Lagrange multipliers)

Duality definitions (1) Primal and (Lagrange) dual problems J(x) { x IR n P = with h j (x) = 0 j = 1, p D = and g i (x) 0 i = 1, q Dual objective function: Q(λ, µ) = inf x = inf x L(x, λ, µ) p J(x) + λ j h j (x) + j=1 Wolf dual problem max L(x, λ, µ) x,λ IR p,µ IR q with µ j 0 j = 1, q W = p and x J(x) + λ j x h j (x) + j=1 max λ IR p,µ IR q Q(λ, µ) with µ j 0 j = 1, q q µ i g i (x) q µ i x g i (x) = 0

Duality theorems (2) Theorem (12.12, 12.13 and 12.14 Nocedal & Wright pp 346) If f, g and h are convex and continuously differentiable a, then the solution of the dual problem is the same as the solution of the primal a under some conditions e.g. linear independence constraint qualification (λ, µ ) = solution of problem D x = arg L(x, λ, µ ) x Q(λ, µ ) = arg x L(x, λ, µ ) = L(x, λ, µ ) = J(x ) + λ H(x ) + µ G(x ) = J(x ) and for any feasible point x Q(λ, µ) J(x) 0 J(x) Q(λ, µ) The duality gap is the difference between the primal and dual cost functions

Road map 1 Linear SVM Optimization in 10 slides Dual formulation of the linear SVM Solving the dual Figure from L. Bottou & C.J. Lin, Support vector machine solvers, in Large scale kernel machines, 2007.

Linear SVM dual formulation - The lagrangian { w,b 1 2 w 2 with y i (w x i + b) 1 i = 1, n Looking for the lagrangian saddle point max α lagrange multipliers α i 0 L(w, b, α) = 1 2 w 2 w,b L(w, b, α) with so called ( α i yi (w x i + b) 1 ) α i represents the influence of constraint thus the influence of the training example (x i, y i )

Stationarity conditions L(w, b, α) = 1 2 w 2 Computing the gradients: ( α i yi (w x i + b) 1 ) w L(w, b, α) L(w, b, α) b we have the following optimality conditions = w w L(w, b, α) = 0 w = L(w, b, α) b = 0 α i y i x i = n α i y i α i y i x i α i y i = 0

KKT conditions for SVM stationarity w α i y i x i = 0 and α i y i = 0 primal admissibility y i (w x i + b) 1 i = 1,..., n dual admissibility α i 0 i = 1,..., n ) complementarity α i (y i (w x i + b) 1 = 0 i = 1,..., n The complementary condition split the data into two sets A be the set of active constraints: usefull points A = {i [1, n] y i (w x i + b ) = 1} its complementary Ā useless points if i / A, α i = 0

The KKT conditions for SVM The same KKT but using matrix notations and the active set A stationarity w X D y α = 0 α y = 0 primal admissibility D y (Xw + b I1) I1 dual admissibility α 0 complementarity D y (X A w + b I1 A ) = I1 A α Ā = 0 Knowing A, the solution verifies the following linear system: w XA D y α A = 0 D y X A w by A = e A ya α A = 0 with D y = diag(y A ), α A = α(a), y A = y(a) et X A = X (X A ; :).

The KKT conditions as a linear system w X A D y α A = 0 D y X A w by A = e A y A α A = 0 with D y = diag(y A ), α A = α(a), y A = y(a) et X A = X (X A ; :). I X A D y 0 w 0 D y X A 0 y A α A = e A 0 y A 0 b 0 we can work on it to separate w from (α A, b)

The SVM dual formulation The SVM Wolfe dual max w,b,α using the fact: w = 1 2 w 2 ( α i yi (w x i + b) 1 ) with α i 0 i = 1,..., n and w α i y i x i = 0 and α i y i = 0 α i y i x i The SVM Wolfe dual without w and b max 1 α 2 α j α i y i y j x j x i + α i j=1 with α i 0 i = 1,..., n and α i y i = 0

Linear SVM dual formulation Optimality: w = L(α) = 1 2 = 1 2 L(w, b, α) = 1 2 w 2 α i y i x i j=1 α j α i y i y j x j x i }{{} j=1 w w ( α i yi (w x i + b) 1 ) α i y i = 0 α j α i y i y j x j x i + n α iy i α i α j y j x j x i b j=1 } {{ } w Dual linear SVM is also a quadratic program 1 α IR n 2 α Gα e α problem D with y α = 0 and 0 α i i = 1, n with G a symmetric matrix n n such that G ij = y i y j x j x i α i y i } {{ } =0 + n α i

SVM primal vs. dual Primal Dual w IR d,b IR 1 2 w 2 with y i (w x i + b) 1 i = 1, n d + 1 unknown n constraints classical QP perfect when d << n 1 α IR n 2 α Gα e α with y α = 0 and 0 α i i = 1, n n unknown G Gram matrix (pairwise influence matrix) n box constraints easy to solve to be used when d > n

SVM primal vs. dual Primal Dual w IR d,b IR 1 2 w 2 with y i (w x i + b) 1 i = 1, n d + 1 unknown n constraints classical QP perfect when d << n f (x) = d w j x j + b = j=1 1 α IR n 2 α Gα e α with y α = 0 and 0 α i i = 1, n n unknown G Gram matrix (pairwise influence matrix) n box constraints easy to solve to be used when d > n α i y i (x x i ) + b

The bi dual (the dual of the dual) 1 α IR n 2 α Gα e α with y α = 0 and 0 α i i = 1, n The bidual L(α, λ, µ) = 1 2 α Gα e α + λ y α µ α α L(α, λ, µ) = Gα e + λ y µ max α,λ,µ 1 2 α Gα with Gα e + λ y µ = 0 and 0 µ since w 2 = 1 2 α Gα and DX w = Gα { max w,λ with 1 2 w 2 DX w + λ y e by identification (possibly up to a sign) b = λ is the Lagrange multiplier of the equality constraint

Cold case: the least square problem Linear model d y i = w j x ij + ε i, i = 1, n j=1 n data and d variables; d < n 2 d = x ij w j y i = X w y 2 w j=1 Solution: w = (X X ) 1 X y f (x) = x (X X ) 1 X y }{{} w What is the influence of each data point (matrix X lines)? Shawe-Taylor & Cristianini s Book, 2004

data point influence (contribution) for any new data point x f (x) = x (X X )(X X ) 1 (X X ) 1 X y }{{} w = x X X (X X ) 1 (X X ) 1 X y }{{} α x d variables n examples X α w f (x) = d w j x j j=1

data point influence (contribution) for any new data point x f (x) = x (X X )(X X ) 1 (X X ) 1 X y }{{} w = x X X (X X ) 1 (X X ) 1 X y }{{} α x d variables n examples X x x i α w f (x) = d w j x j = j=1 α i (x x i ) from variables to examples α = X (X X ) 1 w }{{} n examples et w = X α }{{} d variables what if d n!

SVM primal vs. dual Primal Dual w IR d,b IR 1 2 w 2 with y i (w x i + b) 1 i = 1, n d + 1 unknown n constraints classical QP perfect when d << n f (x) = d w j x j + b = j=1 1 α IR n 2 α Gα e α with y α = 0 and 0 α i i = 1, n n unknown G Gram matrix (pairwise influence matrix) n box constraints easy to solve to be used when d > n α i y i (x x i ) + b

Road map 1 Linear SVM Optimization in 10 slides Dual formulation of the linear SVM Solving the dual Figure from L. Bottou & C.J. Lin, Support vector machine solvers, in Large scale kernel machines, 2007.

Solving the dual (1) Data point influence α i = 0 this point is useless α i 0 this point is said to be support f (x) = d w j x j + b = j=1 α i y i (x x i ) + b

Solving the dual (1) Data point influence α i = 0 this point is useless α i 0 this point is said to be support f (x) = d w j x j + b = 3 α i y i (x x i ) + b j=1 Decison border only depends on 3 points (d + 1)

Solving the dual (2) Assume we know these 3 data points 1 α IR n 2 α Gα e α with y α = 0 and 0 α i i = 1, n = { α IR 3 1 2 α Gα e α with y α = 0 L(α, b) = 1 2 α Gα e α + b y α solve the following linear system { Gα + b y = e y α = 0 U = chol(g); % upper a = U\ (U \e); c = U\ (U \y); b = (y *a)\(y *c) alpha = U\ (U \(e - b*y));

Conclusion: variables or data point? seeking for a universal learning algorithm no model for IP(x, y) the linear case: data is separable the non separable case double objective: imizing the error together with the regularity of the solution multi objective optimisation dualiy : variable example use the primal when d < n (in the liner case) or when matrix G is hard to compute otherwise use the dual universality = nonlinearity kernels