Support Vector Machines for Classification and Regression

Similar documents
Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

ICS-E4030 Kernel Methods in Machine Learning

Introduction to Support Vector Machines

CIS 520: Machine Learning Oct 09, Kernel Methods

Announcements - Homework

Lecture 10: A brief introduction to Support Vector Machine

Lecture Support Vector Machine (SVM) Classifiers

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machines, Kernel SVM

Support Vector Machine (continued)

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Support Vector Machine (SVM) and Kernel Methods

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

Review: Support vector machines. Machine learning techniques and image analysis

Lecture Notes on Support Vector Machine

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Soft-margin SVM can address linearly separable problems with outliers

Support Vector Machine (SVM) and Kernel Methods

Constrained Optimization and Lagrangian Duality

Support Vector Machine for Classification and Regression

Support Vector Machine

COMP 875 Announcements

Convex Optimization and Support Vector Machine

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning And Applications: Supervised Learning-SVM

CS-E4830 Kernel Methods in Machine Learning

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Non-linear Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

Machine Learning. Support Vector Machines. Manfred Huber

Lecture 10: Support Vector Machine and Large Margin Classifier

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Modelli Lineari (Generalizzati) e SVM

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

SVM and Kernel machine

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Kernel Methods and Support Vector Machines

Support Vector Machines Explained

Least Squares Regression

Support Vector Machines

LECTURE 7 Support vector machines

Support vector machines

Classification and Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machines and Kernel Methods

Support Vector Machines

CS , Fall 2011 Assignment 2 Solutions

Linear & nonlinear classifiers

Linear & nonlinear classifiers

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machine

Support Vector Machines

Least Squares Regression

Statistical Machine Learning from Data

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

Pattern Recognition 2018 Support Vector Machines

(Kernels +) Support Vector Machines

Support Vector Machines

Lecture 3 January 28

L5 Support Vector Classification

Statistical Pattern Recognition

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Max Margin-Classifier

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Lecture 2: Linear SVM in the Dual

Support Vector Machines

Incorporating detractors into SVM classification

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Linear, Binary SVM Classifiers

HOMEWORK 4: SVMS AND KERNELS

ML (cont.): SUPPORT VECTOR MACHINES

CS798: Selected topics in Machine Learning

Support Vector Machines

An introduction to Support Vector Machines

Polyhedral Computation. Linear Classifiers & the SVM

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Convex Optimization and SVM

Support Vector Machines. Maximizing the Margin

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

Support Vector Machines.

Support vector machines Lecture 4

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines for Regression

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

Transcription:

CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the material discussed in the lecture (and vice versa). Outline Linearly separable data: Hard margin SVMs Non-linearly separable data: Soft margin SVMs Loss imization view Support vector regression (SVR) Linearly Separable Data: Hard Margin SVMs In this lecture we consider linear support vector machines (SVMs); we will consider nonlinear extensions in the next lecture. Let X = R d, and consider a binary classification task with Y = Ŷ = {±}. A training sample S = ((x, y ),..., (x m, y m )) (R d {±}) m is said to be linearly separable if there exists a linear classifier h w,b (x) = sign(w x + b) which classifies all examples in S correctly, i.e. for which y i (w x i + b) > 0 i [m]. For example, Figure (left) shows a training sample in R 2 that is linearly separable, together with two possible linear classifiers that separate the data correctly (note that the decision surface of a linear classifier in 2 dimensions is a line, and more generally in d > 2 dimensions is a hyperplane). Which of the two classifiers is likely to give better generalization performance? Figure : Left: A linearly separable data set, with two possible linear classifiers that separate the data. Blue circles represent class label and red crosses ; the arrow represents the direction of positive classification. Right: The same data set and classifiers, with margin of separation shown. Although both classifiers separate the data, the distance or margin with which the separation is achieved is different; this is shown in Figure (right). For the rest of this section, assume that the training sample S = ((x, y ),..., (x m, y m )) is linearly separable; in this setting, the SVM algorithm selects the maximum

2 Support Vector Machines for Classification and Regression margin linear classifier, i.e. the linear classifier that separates the training data with the largest margin. More precisely, define the (geometric) margin of a linear classifier h w,b (x) = sign(w x + b) on an example (x i, y i ) R d {±} as γ i = y i(w x i + b) w 2. () Note that the distance of x i from the hyperplane w x+b = 0 is given by w x i+b w 2 ; therefore the above margin on (x i, y i ) is simply a signed version of this distance, with a positive sign if the example is classified correctly and negative otherwise. The (geometric) margin of h w,b on the sample S = ((x, y ),..., (x m, y m )) is then defined as the imal margin on examples in S: γ = i [m] γ i. (2) Given a linearly separable training sample S = ((x, y ),..., (x m, y m )) (R d {±}) m, the hard margin SVM algorithm finds a linear classifier that maximizes the above margin on S. In particular, any linear classifier that separates S correctly will have margin γ > 0; without loss of generality, we can represent any such classifier by some (w, b) such that The margin of such a classifier on S then becomes simply y i(w x i + b) =. (3) i [m] y i (w x i + b) γ = =. (4) i [m] w 2 w 2 Thus, maximizing the margin becomes equivalent to imizing the norm w 2 the constraints in Eq. (3), which can be written as the following optimization problem: w,b 2 w 2 2 (5) y i (w x i + b), i =,..., m. (6) This is a convex quadratic program (QP) and can in principle be solved directly. However it is useful to consider the dual of the above problem, which sheds light on the structure of the solution and also facilitates the extension to nonlinear classifiers which we will see in the next lecture. Note that by our assumption that the data is linearly separable, the above problem satisfies Slater s condition, and so strong duality holds. Therefore solving the dual problem is equivalent to solving the above primal problem. Introducing dual variables (or Lagrange multipliers) α i 0 (i =,..., m) for the inequality constraints above gives the Lagrangian function L(w, b, α) = 2 w 2 2 + α i ( y i (w x i + b)). (7) The(Lagrange) dual function is then given by φ(α) = inf L(w, b, α). w R d,b R To compute the dual function, we set the derivatives of L(w, b, α) w.r.t. w and b to zero; this gives the following: w = α i y i x i (8) α i y i = 0. (9)

Support Vector Machines for Classification and Regression 3 Substituting these back into L(w, b, α), we have the following dual function: φ(α) = 2 α i α j y i y j (x i x j ) + α i ; j= this dual function is defined over the domain { α R m : m α iy i = 0 }. This leads to the following dual problem: max α 2 α i α j y i y j (x i x j ) + α i (0) j= α i y i = 0 () α i 0, i =,..., m. (2) This is again a convex QP (in the m variables α i ) and can be solved efficiently using numerical optimization methods. On obtaining the solution α to the above dual problem, the weight vector ŵ corresponding to the maximal margin classifier can be obtained via Eq. (8): ŵ = Now, by the complementary slackness condition in the KKT conditions, we have for each i [m], This gives α i ( yi (ŵ x i + b) ) = 0. α i > 0 = y i (ŵ x i + b) = 0. In other words, α i is positive only for training points x i that lie on the margin, i.e. that are closest to the separating hyperplane; these points are called the support vectors. For all other training points x i, we have α i = 0. Thus the solution for ŵ can be written as a linear combination of just the support vectors; specifically, if we define SV = { i [m] : α i > 0 }, then we have Moreover, for all i SV, we have ŵ = y i (ŵ x i + b) = 0 or y i (ŵ x i + b) = 0. This allows us to obtain b from any of the support vectors; in practice, for numerical stability, one generally averages over all the support vectors, giving b = SV (y i ŵ x i ). In order to classify a new point x R d using the learned classifier, one then computes ( ) hŵ, b(x) = sign(ŵ x + b) = sign α i y i (x i x) + b. (3)

4 Support Vector Machines for Classification and Regression 2 Non-Linearly Separable Data: Soft Margin SVMs The above derivation assumed the existence of a linear classifier that can correctly classify all examples in a given training sample S = ((x, y ),..., (x m, y m )). But what if the sample is not linearly separable? In this case, one needs to allow for the possibility of errors in classification. This is usually done by relaxing the constraints in Eq. (6) through the introduction of slack variables ξ i 0 (i =,..., m), and requiring only that y i (w x i + b) ξ i, i =,..., m. (4) An extra cost for errors can be assigned as follows: w,b,ξ 2 w 2 2 + C ξ i (5) y i (w x i + b) ξ i, i =,..., m (6) ξ i 0, i =,..., m. (7) Thus, whenever y i (w x i + b) <, we pay an associated cost of Cξ i = C( y i (w x i + b)) in the objective function; a classification error occurs when y i (w x i + b) 0, or equivalently when ξ i. The parameter C > 0 controls the tradeoff between increasing the margin (imizing w 2 ) and reducing the errors (imizing ξ i ): a large value of C keeps the errors small at the cost of a reduced margin; a small value of C allows for more errors while increasing the margin on the remaining examples. Forg the dual of the above problem as before leads to the same convex QP as in the linearly separable case, except that the constraints in Eq. (2) are replaced by The solution for ŵ is obtained similarly to the linearly separable case: 0 α i C i =,..., m. (8) ŵ = In this case, the complementary slackness conditions yield for each i [m]: 2 This gives In particular, this gives α i ( ξi y i (ŵ x i + b) ) = 0 (C α i ) ξ i = 0. α i > 0 = ξ i y i (ŵ x i + b) = 0 α i < C = ξ i = 0. 0 < α i < C = y i (ŵ x i + b) = 0 ; these are the points on the margin. Thus here we have three types of support vectors with α i > 0 (see Figure 2): To see this, note that in this case there are 2m dual variables, say {α i } for the first set of inequality constraints and {β i } for the second set of inequality constraints ξ i 0. When setting the derivative of the Lagrangian L(w, b, ξ, α, β) w.r.t. ξ i to zero, one gets α i + β i = C, allowing one to replace β i with C α i throughout; the constraint β i 0 then becomes α i C. 2 Again, the second set of complementary slackness conditions here are obtained by replacing the dual variables β i (for the inequality constraints ξ i 0) with C α i throughout; see also Footnote.

Support Vector Machines for Classification and Regression 5 SV = { i [m] : 0 < α i < C } SV 2 = { i [m] : α i = C, ξi < } SV 3 = { i [m] : α i = C, ξi }. SV contains margin support vectors ( ξ i = 0; these lie on the margin and are correctly classified); SV 2 contains non-margin support vectors with 0 < ξ i < (these are correctly classified, but lie within the margin); SV 3 contains non-margin support vectors with ξ i (these correspond to classification errors). Let SV = SV SV 2 SV 3. Figure 2: Three types of support vectors in the non-separable case. Then we have ŵ = Moreover, we can use the margin support vectors in SV to compute b: b = SV (y i ŵ x i ). The above formulation of the SVM algorithm for the general (nonseparable) case is often called the soft margin SVM. 3 Loss Minimization View An alternative motivation for the (soft margin) SVM algorithm is in terms of imizing the hinge loss on the training sample S = ((x, y ),..., (x m, y m )). Specifically, define l hinge : {±} R R + as l hinge (y, f) = ( yf) +, (9) where z + = max(0, z). This loss is convex in f and upper bounds the 0- loss much as the logistic loss does. Now consider learning a linear classifier that imizes the empirical hinge loss, plus an L 2 regularization term: w,b Introducing slack variables ξ i (i =,..., m), we can re-write this as w,b,ξ m m ( yi (w x i + b) ) + + λ w 2 2. (20) ξ i + λ w 2 2 (2) ξ i y i (w x i + b), i =,..., m (22) ξ i 0, i =,..., m. (23) This is equivalent to the soft margin SVM (with C = 2λm ); in other words, the soft margin SVM algorithm derived earlier effectively performs L 2 -regularized empirical hinge loss imization (with λ = 2Cm )!

6 Support Vector Machines for Classification and Regression 4 Support Vector Regression (SVR) Consider now a regression problem with X = R d and Y = Ŷ = R. Given a training sample S = ((x, y ),..., (x m, y m )) (R d R) m, the support vector regression (SVR) algorithm imizes an L 2 -regularized form of the ɛ-insensitive loss l ɛ : R R R +, defined as This yields Introducing slack variables ξ i, ξi this as l ɛ (y, ŷ) = ( ŷ y ɛ ) (24) + { 0 if ŷ y ɛ = (25) ŷ y ɛ otherwise. w,b m ( (w ) x i + b) y i ɛ + + λ w 2 2. (26) (i =,..., m) and writing λ = 2Cm for appropriate C > 0, we can re-write w,b,ξ,ξ 2 w 2 2 + C (ξ i + ξi ) (27) ξ i y i (w x i + b) ɛ, i =,..., m (28) ξ i (w x i + b) y i ɛ, i =,..., m (29) ξ i, ξ i 0, i =,..., m. (30) This is again a convex QP that can in principle be solved directly; again, it useful to consider the dual, which helps to understand the structure of the solution and facilitates the extension to nonlinear SVR. We leave the details as an exercise; the resulting dual problem has the following form: max α (α i αi )(α j αj )(x i x j ) + y i (α i αi ) ɛ (α i + αi ) (3) 2 j= (α i αi ) = 0 (32) 0 α i C, i =,..., m. (33) 0 αi C, i =,..., m. (34) This is again a convex QP (in the 2m variables α i, αi ); the solution α, α can be used to find the solution ŵ to the primal problem as follows: ŵ = ( α i α i )x i. In this case, the complementary slackness conditions yield for each i [m]: α i ( ξi y i + (ŵ x i + b) + ɛ ) = 0 α i ( ξ i + y i (ŵ x i + b) + ɛ ) = 0 (C α i ) ξ i = 0 (C α i ) ξ i = 0.

Support Vector Machines for Classification and Regression 7 Analysis of these conditions shows that for each i, either α i or α i (or both) must be zero. For points inside the ɛ-tube around the learned linear function, i.e. for which (ŵ x i + b) y i < ɛ, we have both α i = α i = 0. The remaining points constitute two types of support vectors: SV = { i [m] : 0 < α i < C or 0 < α i < C } SV 2 = { i [m] : α i = C or α i = C }. SV contains support vectors on the tube boundary (with ξ i = ξ i = 0); SV 2 contains support vectors outside the tube (with ξ i > 0 or ξ i > 0). Taking SV = SV SV 2, we then have ŵ = ( α i α i )x i. As before, the boundary support vectors in SV can be used to compute b, which gives ( b = (y i ŵ x i ɛ) + ) (ŵ x i y i ɛ). SV i:0< α i<c i:0< α i <C The prediction for a new point x R d is then made via fŵ, b(x) = ŵ x + b = ( α i α i )(x i x) + b. In practice, the parameter C in SVM and the parameters C and ɛ in SVR are generally selected by crossvalidation on the training sample (or using a separate validation set). An alternative parametrization of the SVM and SVR optimization problems, termed ν-svm and ν-svr, makes use of a different parameter ν that directly bounds the fraction of training examples that end up as support vectors. Exercise. Derive the dual of the SVR optimization problem above. Exercise. Derive an alternative formulation of the SVR optimization problem that makes use of a single slack variable ξ i for each data point rather than two slack variables ξ i, ξi. Show that this leads to the same solution as above. Exercise. Derive a regression algorithm that given a training sample S, imizes on S the L 2 -regularized absolute loss l abs : R R R +, given by l abs (y, ŷ) = ŷ y, over all linear functions.