Support Vector Machine

Similar documents
Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Non-linear Support Vector Machines

Support Vector Machine (continued)

L5 Support Vector Classification

Support Vector Machine (SVM) and Kernel Methods

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machines for Classification and Regression

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning. Support Vector Machines. Manfred Huber

ICS-E4030 Kernel Methods in Machine Learning

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Lecture Notes on Support Vector Machine

Linear & nonlinear classifiers

Linear & nonlinear classifiers

Jeff Howbert Introduction to Machine Learning Winter

Machine Learning A Geometric Approach

Soft-margin SVM can address linearly separable problems with outliers

Statistical Machine Learning from Data

CS-E4830 Kernel Methods in Machine Learning

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Support Vector Machine (SVM) and Kernel Methods

Pattern Recognition 2018 Support Vector Machines

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Statistical Pattern Recognition

Support Vector Machines

Support Vector Machines

Introduction to Support Vector Machines

Support Vector Machines

Constrained Optimization and Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support vector machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines, Kernel SVM

An introduction to Support Vector Machines

LECTURE 7 Support vector machines

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning And Applications: Supervised Learning-SVM

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Support Vector Machine for Classification and Regression

Lecture 10: Support Vector Machine and Large Margin Classifier

CS798: Selected topics in Machine Learning

Review: Support vector machines. Machine learning techniques and image analysis

Lecture 10: A brief introduction to Support Vector Machine

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Soft-margin SVM can address linearly separable problems with outliers

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines for Regression

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Kernel Methods and Support Vector Machines

COMP 875 Announcements

Linear discriminant functions

Max Margin-Classifier

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines

Convex Optimization and Support Vector Machine

Support Vector Machines and Kernel Methods

Support Vector Machines Explained

Lecture 2: Linear SVM in the Dual

Support Vector Machines

Support Vector Machines. Maximizing the Margin

Support Vector Machines

Machine Learning Lecture 6 Note

SMO Algorithms for Support Vector Machines without Bias Term

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

SVM and Kernel machine

A Tutorial on Support Vector Machine

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Lecture 18: Optimization Programming

Support Vector Machines

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

CS , Fall 2011 Assignment 2 Solutions

Announcements - Homework

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Soft-Margin Support Vector Machine

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Transcription:

Andrea Passerini passerini@disi.unitn.it Machine Learning

Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training examples (support vectors) Sound generalization theory (bounds or error based on margin) Can be easily extended to nonlinear separation (kernel machines)

Maximum margin classifier

Maximum margin classifier Classifier margin Given a training set D, a classifier confidence margin is: ρ = min (x,y) Dyf (x) It is the minimal confidence margin (for predicting the true label) among training examples A classifier geometric margin is: ρ w = min yf (x) (x,y) D w

Maximum margin classifier Canonical hyperplane There is an infinite number of equivalent formulation for the same hyperplane: w T x + w 0 = 0 α(w T x + w 0 ) = 0 α 0 The canonical hyperplane is the hyperplane having confidence margin equal to 1: Its geometric margin is: ρ = min (x,y) Dyf (x) = 1 ρ w = 1 w

Maximum margin classifier

Hard margin SVM Theorem (Margin Error Bound) Consider the set of decision functions f (x) = signw T x with w Λ and x R, for some R,Λ > 0. Moreover, let ρ > 0 and ν denote the fraction of training examples with margin smaller than ρ/ w, referred to as the margin error. For all distributions P generating the data, with probability at least 1 δ over the drawing of the m training patterns, and for any ρ > 0 and δ (0, 1), the probability that a test pattern drawn from P will be misclassified is bound from above by ( c R ν + 2 Λ 2 ) ln 2 m + ln(1/δ). m ρ 2 Here, c is a universal constant.

Hard margin SVM Margin Error Bound: interpretation ( c R ν + 2 Λ 2 ) ln 2 m + ln(1/δ). m The probability of test error depends on (among other components): Note ρ 2 number of margin errors ν (examples with margin smaller than ρ/ w ) ln number of training examples (error depends on 2 m m ) size of the margin (error depends on 1/ρ 2 ) If ρ is fixed to 1 (canonical hyperplane), maximizing margin corresponds to minimizing w

Hard margin SVM Learning problem Note minw,w 0 1 2 w 2 subject to: y i (w T x i + w 0 ) 1 (x i, y i ) D constraints guarantee that all points are correctly classified (plus canonical form) minimization corresponds to maximizing the (squared) margin quadratic optimization problem (objective is quadratic, points satisfying constraints form a convex set)

Hard margin SVM Learning problem Note minw,w 0 1 2 w 2 subject to: y i (w T x i + w 0 ) 1 (x i, y i ) D constraints guarantee that all points are correctly classified (plus canonical form) minimization corresponds to maximizing the (squared) margin quadratic optimization problem (objective is quadratic, points satisfying constraints form a convex set)

Digression: constrained optimization Karush-Kuhn-Tucker (KKT) approach A constrained optimization problem can be addressed by converting it into an unconstrained problem with the same solution Let s have a constrained optimization problem as: min z subject to: f (z) g i (z) 0 i Let s introduce a non-negative variable α i 0 (called Lagrange multiplier) for each constraint and rewrite the optimization problem as (Lagrangian): min z maxα 0 f (z) i α i g i (z)

Digression: constrained optimization Karush-Kuhn-Tucker (KKT) approach min z maxα 0 f (z) i α i g i (z) The optimal solutions z for this problem are the same as the optimal solutions for the original (constrained) problem: If for a given z at least one constraint is not satisfied, i.e. g i (z ) < 0 for some i, maximizing over α i leads to an infinite value (not a minimum, unless there is no non-infinite minimum) If all constraints are satisfied (i.e. g i (z ) 0 for all i), maximization over the α will set all elements of the summation to zero, so that z is a solution of min z f (z).

Hard margin SVM Karush-Kuhn-Tucker (KKT) approach minw,w 0 1 2 w 2 subject to: y i (w T x i + w 0 ) 1 (x i, y i ) D The constraints can be included in the minimization using Lagrange multipliers α i 0 (m = D ): L(w, w 0, α) = 1 2 w 2 α i (y i (w T x i + w 0 ) 1) The Lagrangian is minimized wrt w, w 0 and maximized wrt α i (solution is a saddle point)

Hard margin SVM Dual formulation L(w, w 0, α) = 1 2 w 2 α i (y i (w T x i + w 0 ) 1) Vanishing derivatives wrt primal variables we get: w 0 L(w, w 0, α) = 0 α i y i = 0 w L(w, w 0, α) = 0 w = α i y i x i

Hard margin SVM Dual formulation Substituting in the Lagrangian we get: 1 2 w 2 1 2 α i (y i (w T x i + w 0 ) 1) = α i α j y i y j x T i x j i,j=1 α i y i w 0 + }{{} =0 α i 1 2 α i = α i α j y i y j x T i x j i,j=1 α i α j y i y j x T i x j = L(α) i,j=1 which is to be maximized wrt the dual variables α

Hard margin SVM Dual formulation max α IR m α i 1 2 α i α j y i y j x T i x j i,j=1 subject to α i 0 i = 1,..., m α i y i = 0 The resulting maximization problem including the constraints Still a quadratic optimization problem

Hard margin SVM Note The dual formulation has simpler contraints (box), easier to solve The primal formulation has d + 1 variables (number of features +1): minw,w 0 1 2 w 2 The dual formulation has m variables (number of training examples): max α IR m α i 1 2 α i α j y i y j x T i x j i,j=1 One can choose the primal formulation if it has much less variables (problem dependent)

Hard margin SVM Decision function Substituting w = m α iy i x i in the decision function we get: f (x) = w T x + w 0 = α i y i x T i x + w 0 The decision function is linear combination of dot products between training points and the test point dot product is kind of similarity between points Weights of the combination are α i y i : large α i implies large contribution towards class y i (times the similarity)

Hard margin SVM Karush-Khun-Tucker conditions (KKT) L(w, w 0, α) = 1 2 w 2 α i (y i (w T x i + w 0 ) 1) At the saddle point it holds that for all i: α i (y i (w T x i + w 0 ) 1) = 0 Thus, either the example does not contribute to the final f (x): α i = 0 or the example stays on the minimal confidence hyperplane from the decision one: y i (w T x i + w 0 ) = 1

Hard margin SVM Support vectors points staying on the minimal confidence hyperplanes are called support vectors All other points do not contribute to the final decision function (i.e. they could be removed from the training set) SVM are sparse i.e. they typically have few support vectors

Hard margin SVM Decision function bias The bias w 0 can be computed from the KKT conditions Given an arbitrary support vector x i (with α i > 0) the KKT conditions imply: y i (w T x i + w 0 ) = 1 y i w T x i + y i w 0 = 1 w 0 = 1 y iw T x i y i For robustness, the bias is usually averaged over all support vectors

Soft margin SVM

Soft margin SVM Slack variables 1 min w X,w 0 IR, ξ IR m 2 w 2 + C subject to ξ i y i (w T x i + w 0 ) 1 ξ i ξ i 0 i = 1,..., m i = 1,..., m A slack variable ξ i represents the penalty for example x i not satisfying the margin constraint The sum of the slacks is minimized together to the inverse margin The regularization parameter C 0 trades-off data fitting and size of the margin

Soft margin SVM Regularization theory 1 min w X,w 0 IR, ξ IR m 2 w 2 + C l(y i, f (x i )) Regularized loss minimization problem The loss term accounts for error minimization The margin maximization term accounts for regularization i.e. solutions with larger margin are preferred Note Regularization is a standard approach to prevent overfitting It corresponds to a prior for simpler (more regular, smoother) solutions

Soft margin SVM Hinge loss l(y i, f (x i )) = 1 y i f (x i ) + = 1 y i (w T x i + w 0 ) + z + = z if z > 0 and 0 otherwise (positive part) it corresponds to the slack variable ξ i (violation of margin costraint) all examples not violating margin costraint have zero loss (sparse set of support vectors)

Soft margin SVM Lagrangian L = C ξ i + 1 2 w 2 α i (y i (w T x i + w 0 ) 1 + ξ i ) where α i 0 and β i 0 Vanishing derivatives wrt primal variables we get: w 0 L = 0 α i y i = 0 m w L = 0 w = α i y i x i ξ i L = 0 C α i β i = 0 β i ξ i

Soft margin SVM Dual formulation Substituting in the Lagrangian we get C ξ i + 1 2 w 2 α i (y i (w T x i + w 0 ) 1 + ξ i ) ξ i (C α i β i ) + 1 }{{} 2 =0 α i y i w 0 + α i = }{{} =0 α i 1 2 α i α j y i y j x T i x j i,j=1 α i α j y i y j x T i x j = L(α) i,j=1 β i ξ i = α i α j y i y j x T i x j i,j=1

Soft margin SVM Dual formulation max α IR m α i 1 2 α i α j y i y j x T i x j i,j=1 subject to 0 α i C i = 1,..., m α i y i = 0 The box constraint for α i comes from C α i β i = 0 (and the fact that both α i 0 and β i 0)

Soft margin SVM Karush-Khun-Tucker conditions (KKT) L = C ξ i + 1 2 w 2 α i (y i (w T x i + w 0 ) 1 + ξ i ) β i ξ i At the saddle point it holds that for all i: α i (y i (w T x i + w 0 ) 1 + ξ i ) = 0 β i ξ i = 0 Thus, support vectors (α i > 0) are examples for which (y i (w T x i + w 0 ) 1

Soft margin SVM Support Vectors α i (y i (w T x i + w 0 ) 1 + ξ i ) = 0 β i ξ i = 0 If α i < C, C α i β i = 0 and β i ξ i = 0 imply that ξ i = 0 These are called unbound SV ((y i (w T x i + w 0 ) = 1, they stay on the confidence one hyperplane If α i = C (bound SV) then ξ i can be greater the zero, in which case the SV are margin errors

Support vectors

Large-scale SVM learning Stochastic gradient descent λ min w X 2 w 2 + 1 m 1 y i w, x i + Objective for a single example (x i, y i ): Subgradient: E(w; (x i, y i )) = λ 2 w 2 + 1 y i w, x i + w E(w; (x i, y i )) = λw 1[y i w, x i < 1]y i x i

Large-scale SVM learning Note Indicator function 1[y i w, x i < 1] = { 1 if yi w, x i < 1 0 otherwise The subgradient of a function f at a point x 0 is any vector v such that for any x: f (x) f (x 0 ) v T (x x 0 )

Large-scale SVM learning Pseudocode (pegasus) 1 Initialize w 1 = 0 2 for t = 1 to T: 1 Randomly choose (x it, y it ) from D 2 Set η t = 1 λt 3 Update w: w t+1 = w t η t w E(w; (x it, y it )) 3 Return w T +1 Note The choice of the learning rate allows to bound the runtime for an ɛ-accurate solution to O(d/λɛ) with d maximum number of non-zero features in an example.

References Biblio C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, 2(2), 121-167, 1998. S. Shalev-Shwartz et al., Pegasos: primal estimated sub-gradient solver for SVM, Mathematical Programming, 127(1), 3-30, 2011. Software svm module in scikit-learn http://scikit-learn.org/stable/index.html libsvm http://www.csie.ntu.edu.tw/ cjlin/libsvm/ svmlight http://svmlight.joachims.org/

APPENDIX Appendix Additional reference material

Large-scale SVM learning Dual version It is easy to show that: t w t+1 = 1 λt 1[y it w t, x it < 1]y it x it We can represent w t+1 implicitly by storing in vector α t+1 the number of times each example was selected and had a non-zero loss, i.e.: α t+1 [j] = {t t : i t = j y j w t, x j < 1}

Large-scale SVM learning Pseudocode (pegasus dual) 1 Initialize α 1 = 0 2 for t = 1 to T: 1 Randomly choose (x it, y it ) from D 2 Set α t+1 = α t 3 If y it 1 λt t j=1 α t[j]y j x j, x it < 1 3 Return α T +1 1 α t+1 [i t] = α t+1 [i t] + 1 Note This will be useful when combined with kernels.