SVM and Kernel machine

Similar documents
Lecture 2: Linear SVM in the Dual

Lecture 6: Minimum encoding ball and Support vector data description (SVDD)

Support Vector Machines for Classification and Regression

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machine (SVM) and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Lecture Support Vector Machine (SVM) Classifiers

Robust minimum encoding ball and Support vector data description (SVDD)

Support Vector Machine

ICS-E4030 Kernel Methods in Machine Learning

Lecture Notes on Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (continued)

Support vector machines

LECTURE 7 Support vector machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machines

CS-E4830 Kernel Methods in Machine Learning

Support Vector Machines

Convex Optimization and Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines and Kernel Methods

Lecture 18: Optimization Programming

Non-linear Support Vector Machines

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Polyhedral Computation. Linear Classifiers & the SVM

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Support Vector Machines

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Soft-margin SVM can address linearly separable problems with outliers

Support Vector Machines

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

Support Vector Machines

Introduction to Support Vector Machines

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Lecture 10: A brief introduction to Support Vector Machine

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSC 411 Lecture 17: Support Vector Machine

Applied inductive learning - Lecture 7

Support Vector Machine

Machine Learning And Applications: Supervised Learning-SVM

Formulation with slack variables

Announcements - Homework

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines for Regression

Statistical Machine Learning from Data

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines

Max Margin-Classifier

Support Vector Machines, Kernel SVM

Support Vector Machines

SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE

ML (cont.): SUPPORT VECTOR MACHINES

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Introduction to Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

ν =.1 a max. of 10% of training set can be margin errors ν =.8 a max. of 80% of training can be margin errors

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Learning with kernels and SVM

Support Vector Machines

The Karush-Kuhn-Tucker conditions

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Kernel Methods and Support Vector Machines

An introduction to Support Vector Machines

Linear & nonlinear classifiers

CS798: Selected topics in Machine Learning

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Optimality, Duality, Complementarity for Constrained Optimization

Review: Support vector machines. Machine learning techniques and image analysis

SMO Algorithms for Support Vector Machines without Bias Term

L5 Support Vector Classification

Sequential Minimal Optimization (SMO)

Lectures 9 and 10: Constrained optimization problems and their optimality conditions

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Pattern Recognition 2018 Support Vector Machines

Machine Learning A Geometric Approach

Linear & nonlinear classifiers

Homework 3. Convex Optimization /36-725

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machine I

Incorporating detractors into SVM classification

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Lecture 9: Multi Kernel SVM

Transcription:

SVM and Kernel machine Stéphane Canu stephane.canu@litislab.eu Escuela de Ciencias Informáticas 20 July 3, 20

Road map Linear SVM Linear classification The margin Linear SVM: the problem Optimization in 5 slides Dual formulation of the linear SVM Solving the dual The non separable case

Linear classification Find a line to separate (classify) blue from red D(x) = sign ( v x+a )

Linear classification Find a line to separate (classify) blue from red D(x) = sign ( v x+a ) the decision border: v x+a = 0

Linear classification Find a line to separate (classify) blue from red D(x) = sign ( v x+a ) the decision border: v x+a = 0 How to choose a solution? there are many solutions... The problem is ill posed

This is not the problem we want to solve {(x i, y i ); i = : n} a training sample, i.i.d. drawn according to IP(x, y) unknown we want to be able to classify new observations: imize IP(error)

This is not the problem we want to solve {(x i, y i ); i = : n} a training sample, i.i.d. drawn according to IP(x, y) unknown we want to be able to classify new observations: imize IP(error) Looking for a universal approach use training data: (a few errors) prove IP(error) remains small non linear scalable - algorithmic complexity

Vapnik s Book, 982 This is not the problem we want to solve {(x i, y i ); i = : n} a training sample, i.i.d. drawn according to IP(x, y) unknown with large probability: IP(error) < ÎP(error) }{{} =0 here we want to be able to classify new observations: imize IP(error) Looking for a universal approach use training data: (a few errors) prove IP(error) remains small non linear scalable - algorithmic complexity + ϕ( v }{{} margin )

Statistical machine learning Computation learning theory (COLT) {x {x i, y i } i i =, n A f = v x+a x Loss L y p = f(x) ÎP(error) = n L(f(x i), y i ) Vapnik s Book, 982

Statistical machine learning Computation learning theory (COLT) x P IP {x {x i, y i } i i =, n A f = v x+a y Loss L y p = f(x) IP P ( IP(error) ÎP(error) ) Prob = = IE(L) n L(f(x + ϕ( v ) δ i), y i ) Vapnik s Book, 982

linear discriation Find a line to classify blue and red D(x) = sign ( v x+a ) the decision border: v x+a = 0 there are many solutions... The problem is ill posed How to choose a solution? choose the one with larger margin

Road map Linear SVM Linear classification The margin Linear SVM: the problem Optimization in 5 slides Dual formulation of the linear SVM Solving the dual The non separable case

Maximize our confidence = maximize the margin the decision border: (v, a) = {x IR d v x+a = 0} maximize the margin max v,a Maximize the margin m max v,a with dist(x i, (v, a)) i [,n] }{{} margin: m v x i + a m,n v the problem is still ill posed if (v, a) is a solution, 0 < k (kv, ka) is also a solution...

Margin and distance: details dist(x i, (v, a)) = arg s { s IR d 2 x i s 2 with v s+a = 0 x i s L(s,λ) = 2 x i s 2 +λ(v s+a) sl(s,λ) = x i + s+λv sl(s,λ) = 0 s = x i λv { v s+a = 0 s = x i λv = v (x i λv)+a = 0 = λ = v x i + a v v s = x i v x i + a v v = x i s 2 = v arg s ( v ) 2 x i + a v v v v x i s = v x i + a v

Margin, norm and regularity w T x max w,b Valeur de la marge dans le cas monodimensionnel {x w T x = 0} / w m + < marge x / w > Maximize the margin max m v,a with,n v x i + a m v 2 = if the is greater, everybody is greater (y i {, }) max v,a with change variable: w = v m and b = a m with y i (w x i + b) ; i =, n and m = w m y i (v x i + a) m, i =, n v 2 = = w = v m w,b w 2 with y i (w x i + b) i =, n

Road map Linear SVM Linear classification The margin Linear SVM: the problem Optimization in 5 slides Dual formulation of the linear SVM Solving the dual The non separable case

Linear SVM: the problem Linear SVM are the solution of the following problem (called primal) Let {(x i, y i ); i = : n} be a set of labelled data with x i IR d, y i {, }. A support vector machine (SVM) is a linear classifier associated with the following decision function: D(x) = sign ( w x+b ) where w IR d and b IR a given thought the solution of the following problem: w,b 2 w 2 = 2 w w with y i (w x i + b) i =, n This is a quadratic program (QP): { z 2 z Az d z with Bz e [ I 0 z = (w, b), d = (0,...,0), A = 0 0 ], B = [yx,y] et e = (,...,)

The linear discriation problem from Learning with Kernels, B. Schölkopf and A. Smolla, MIT Press, 2002.

Road map Linear SVM Linear classification The margin Linear SVM: the problem Optimization in 5 slides Dual formulation of the linear SVM Solving the dual The non separable case

First order optimality condition () problem P = x IR n with and J(x) h j (x) = 0 j =,...,p g i (x) 0 i =,...,q Definition: Karush, Kuhn and Tucker (KKT) conditions p q stationarity J(x )+ λ j h j (x )+ µ i g i (x ) = 0 j= primal admissibility h j (x ) = 0 j =,...,p g i (x ) 0 i =,...,q dual admissibility µ i 0 i =,...,q complementarity µ i g i (x ) = 0 i =,...,q λ j and µ i are called the Lagrange multipliers of problem P

First order optimality condition (2) Theorem (2. Nocedal & Wright pp 32) If a vector x is a stationary point of problem P Then there exists a Lagrange multipliers such that ( x,{λ j } j=:p,{µ i } :q ) fulfill KKT conditions a under some conditions e.g. linear independence constraint qualification If the problem is convex, then a stationary point is the solution of the problem A quadratic program (QP) is convex when... { (QP) z 2 z Az d z with Bz e...when matrix A is positive definite

KKT condition - Lagrangian (3) problem P = x IR n with and J(x) h j (x) = 0 j =,...,p g i (x) 0 i =,...,q Definition: Lagrangian The lagrangian of problem P is the following function: p q L(x,λ,µ) = J(x)+ λ j h j (x)+ µ i g i (x) j= The importance of being a lagrangian the stationarity condition can be written: L(x,λ,µ) = 0 the lagrangian saddle point max L(x,λ,µ) λ,µ x Primal variables: x and dual variables λ,µ (the Lagrange multipliers)

Duality definitions () Primal and (Lagrange) dual problems J(x) { x IR n P = with h j (x) = 0 j =, p D = with and g i (x) 0 i =, q max λ IR p,µ IR q Q(λ,µ) µ j 0 j =, q Dual objective function: Q(λ,µ) = inf x = inf x L(x,λ,µ) p J(x)+ λ j h j (x)+ j= Wolf dual problem max L(x,λ,µ) x,λ IR p,µ IR q with µ j 0 j =, q W = p and J(x )+ λ j h j (x )+ j= q µ i g i (x) q µ i g i (x ) = 0

Duality theorems (2) Theorem (2.2, 2.3 and 2.4 Nocedal & Wright pp 346) If f, g and h are convex and continuously differentiable a Then the solution of the dual problem is the same as the solution of the primal a under some conditions e.g. linear independence constraint qualification (λ,µ ) = solution of problem D x = arg L(x,λ,µ ) x

Road map Linear SVM Linear classification The margin Linear SVM: the problem Optimization in 5 slides Dual formulation of the linear SVM Solving the dual The non separable case

Linear SVM dual formulation - The lagrangian { w,b 2 w 2 with y i (w x i + b) Looking for the lagrangian saddle point max α lagrange multipliers α i 0 L(w, b,α) = 2 w 2 w,b i =, n L(w, b,α) with so called ( α i yi (w x i + b) ) α i represents the influence of constraint thus the influence of the training example (x i, y i )

Optimality conditions L(w, b,α) = 2 w 2 Computing the gradients: ( α i yi (w x i + b) ) w L(w, b,α) = w L(w, b,α) b we have the following optimality conditions w L(w, b,α) = 0 w = L(w, b,α) b = 0 α i y i x i = n α i y i α i y i x i α i y i = 0

Linear SVM dual formulation Optimality: w = L(α) = 2 = 2 L(w, b,α) = 2 w 2 α i y i x i j= α j α i y i y j x j x i }{{} j= w w ( α i yi (w x i + b) ) α i y i = 0 α j α i y i y j x j x i + n α iy i α i α j y j x j x i b j= }{{} w Dual linear SVM is also a quadratic program α IR n 2 α Gα e α problem D with y α = 0 and 0 α i i =, n with G a symmetric matrix n n such that G ij = y i y j x j x i α i y i }{{} =0 + n α i

SVM primal vs. dual Primal Dual w IR d,b IR 2 w 2 with y i (w x i + b) i =, n d + unknown n constraints classical QP perfect when d << n α IR n 2 α Gα e α with y α = 0 and 0 α i i =, n n unknown G Gram matrix (pairwise influence matrix) n box constraints easy to solve to be used when d > n

SVM primal vs. dual Primal Dual w IR d,b IR 2 w 2 with y i (w x i + b) i =, n d + unknown n constraints classical QP perfect when d << n f(x) = d w j x j + b = j= α IR n 2 α Gα e α with y α = 0 and 0 α i i =, n n unknown G Gram matrix (pairwise influence matrix) n box constraints easy to solve to be used when d > n α i y i (x x i )+b

Cold case: the least square problem Linear model d y i = w j x ij +ε i, i =, n j= n data and d variables; d < n 2 d = x ij w j y i = Xw y 2 w j= Solution: w = (X X) X y f(x) = x (X X) X y }{{} w What is the influence of each data point (matrix X lines)? Shawe-Taylor & Cristianini s Book, 2004

data point influence (contribution) for any new data point x f(x) = x (X X)(X X) (X X) X y }{{} w = x X X(X X) (X X) X y }{{} α x d variables n examples X α w f(x) = d w j x j j=

data point influence (contribution) for any new data point x f(x) = x (X X)(X X) (X X) X y }{{} w = x X X(X X) (X X) X y }{{} α x d variables n examples X x x i α w f(x) = d w j x j = j= α i (x x i ) from variables to examples α = X(X X) w }{{} n examples et w = X α }{{} d variables what if d n!

SVM primal vs. dual Primal Dual w IR d,b IR 2 w 2 with y i (w x i + b) i =, n d + unknown n constraints classical QP perfect when d << n f(x) = d w j x j + b = j= α IR n 2 α Gα e α with y α = 0 and 0 α i i =, n n unknown G Gram matrix (pairwise influence matrix) n box constraints easy to solve to be used when d > n α i y i (x x i )+b

Road map Linear SVM Linear classification The margin Linear SVM: the problem Optimization in 5 slides Dual formulation of the linear SVM Solving the dual The non separable case

Solving the dual () Data point influence α i = 0 this point is useless α i 0 this point is said to be support f(x) = d w j x j + b = j= α i y i (x x i )+b

Solving the dual () Data point influence α i = 0 this point is useless α i 0 this point is said to be support f(x) = d w j x j + b = 3 α i y i (x x i )+b j= Decison border only depends on 3 points (d + )

Solving the dual (2) Assume we know these 3 data points α IR n 2 α Gα e α with y α = 0 and 0 α i i =, n = { α IR 3 2 α Gα e α with y α = 0 L(α, b) = 2 α Gα e α+b y α solve the following linear system { Gα + b y = e y α = 0 U = chol(g); % upper a = U\ (U \e); c = U\ (U \y); b = (y *a)\(y *c) alpha = U\ (U \(e - b*y));

Road map Linear SVM Linear classification The margin Linear SVM: the problem Optimization in 5 slides Dual formulation of the linear SVM Solving the dual The non separable case

The non separable case: a bi criteria optimization problem Modeling potential errors: introducing slack variables ξ i (x i, y i ) { no error: yi (w x i + b) ξ i = 0 error: ξ i = y i (w x i + b) > 0 w,b,ξ w,b,ξ 2 w 2 C p ξ p i with y i (w x i + b) ξ i ξ i 0 i =, n Our hope: almost all ξ i = 0

The non separable case Modeling potential errors: introducing slack variables ξ i (x i, y i ) { no error: yi (w x i + b) ξ i = 0 error: ξ i = y i (w x i + b) > 0 Minimizing also the slack (the error), for a given C > 0 w,b,ξ 2 w 2 + C ξ p i p with y i (w x i + b) ξ i i =, n ξ i 0 i =, n Looking for the saddle point of the lagrangian with the Lagrange multipliers α i 0 and β i 0 L(w, b,α,β) = 2 w 2 + C p ξ p i ( α i yi (w ) x i + b) +ξ i β i ξ i

Optimality conditions (p = ) L(w, b,α,β) = 2 w 2 + C Computing the gradients: ξ i ( α i yi (w ) x i + b) +ξ i β i ξ i w L(w, b,α) = w L(w, b,α) b = α i y i x i α i y i ξi L(w, b,α) = C α i β i no change for w and b β i 0 and C α i β i = 0 α i C The dual formulation: α IR n 2 α Gα e α with y α = 0 and 0 α i C i =, n

SVM primal vs. dual Primal Dual w,b,ξ IR n 2 w 2 + C with ξ i y i (w x i + b) ξ i ξ i 0 i =, n d + n+ unknown 2n constraints classical QP to be used when n is too large to build G α IR n 2 α Gα e α with y α = 0 and 0 α i C i =, n n unknown G Gram matrix (pairwise influence matrix) 2n box constraints easy to solve to be used when n is not too large

Optimality conditions (p = 2) L(w, b,α,β) = 2 w 2 + C 2 Computing the gradients: no change for w and b Cξ i α i β i = 0 C 2 ξi 2 ( α i yi (w ) x i + b) +ξ i β i ξ i w L(w, b,α) = w L(w, b,α) b = α i y i x i α i y i ξi L(w, b,α) = Cξ i α i β i n ξ2 i n α iξ i n β iξ i = 2C n α2 i The dual formulation: α IR n 2 α (G + C I)α e α with y α = 0 and 0 α i i =, n

SVM primal vs. dual Primal w,b,ξ IR n 2 w 2 + C with ξ 2 i y i (w x i + b) ξ i ξ i 0 i =, n d + n+ unknown 2n constraints classical QP to be used when n is too large to build G Dual α IR n 2 α (G + C I)α e α with y α = 0 and 0 α i i =, n n unknown G Gram matrix is regularized n box constraints easy to solve to be used when n is not too large

Eliating the slack but not the possible mistakes w,b,ξ IR n 2 w 2 + C 2 with ξ 2 i y i (w x i + b) ξ i ξ i 0 i =, n Introducing the hinge loss ξ i = max ( y i (w x i + b), 0 ) w,b 2 w 2 + C 2 max ( y i (w x i + b), 0 ) 2 Back to d + variables, but this is no longer an explicit QP

SVM (imizing the hinge loss) with variable selection LP SVM w,b w + C max ( y i (w x i + b), 0 ) Lasso SVM w,b w + C 2 max ( y i (w x i + b), 0 ) 2 Elastic net SVM (doubly regularized with p= or p=2) w,b w + 2 w 2 + C max ( y i (w x i + b), 0 ) p Danzig selector for SVM, Generalized Garrote

Conclusion: variables or data point? seeking for a universal learning algorithm no model for IP(x, y) the linear case: data is separable the non separable case double objective: imizing the error together with the regularity of the solution multi objective optimisation dualiy : variable example use the primal when d < n (in the liner case) or when matrix G is hard to compute otherwise use the dual universality = nonlinearity kernels