Support Vector Machines Explained

Similar documents
Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (continued)

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Non-linear Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

ML (cont.): SUPPORT VECTOR MACHINES

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines.

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Review: Support vector machines. Machine learning techniques and image analysis

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Statistical Pattern Recognition

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machine & Its Applications

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Lecture Notes on Support Vector Machine

Support Vector Machines for Classification and Regression

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Pattern Recognition 2018 Support Vector Machines

L5 Support Vector Classification

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Linear & nonlinear classifiers

Convex Optimization and Support Vector Machine

Linear Classification and SVM. Dr. Xin Zhang

Support Vector Machines

Lecture 10: Support Vector Machine and Large Margin Classifier

Linear & nonlinear classifiers

Support Vector Machine

Introduction to Support Vector Machines

Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machine for Classification and Regression

Support Vector Machines. Maximizing the Margin

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Soft-margin SVM can address linearly separable problems with outliers

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Introduction to SVM and RVM

Support Vector Machines and Speaker Verification

CS798: Selected topics in Machine Learning

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Brief Introduction to Machine Learning

Lecture Support Vector Machine (SVM) Classifiers

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machines

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Kernel Methods and Support Vector Machines

Support Vector Machines

Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machines

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

Announcements - Homework

Support Vector Machines and Kernel Algorithms

Support Vector Machine. Industrial AI Lab.

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Applied inductive learning - Lecture 7

Support Vector Machine

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machines

(Kernels +) Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

Introduction to Support Vector Machines

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Learning with kernels and SVM

Soft-margin SVM can address linearly separable problems with outliers

Applied Machine Learning Annalisa Marsico

CS145: INTRODUCTION TO DATA MINING

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Introduction to Support Vector Machines

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Max Margin-Classifier

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines

SUPPORT VECTOR MACHINE

Polyhedral Computation. Linear Classifiers & the SVM

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

18.9 SUPPORT VECTOR MACHINES

Statistical Machine Learning from Data

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Modelli Lineari (Generalizzati) e SVM

Support Vector Machines with Example Dependent Costs

SVM optimization and Kernel methods

Formulation with slack variables

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

An introduction to Support Vector Machines

Transcription:

December 23, 2008 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/

Introduction This document has been written in an attempt to make the Support Vector Machines (SVM), initially conceived of by Cortes and Vapnik [1], as simple to understand as possible for those with minimal experience of Machine Learning. It assumes basic mathematical knowledge in areas such as calculus, vector geometry and Lagrange multipliers. The document has been split into Theory and Application sections so that it is obvious, after the maths has been dealt with, how to actually apply the SVM for the different forms of problem that each section is centred on. The document s first section details the problem of classification for linearly separable data and introduces the concept of margin and the essence of SVM - margin maximization. The methodology of the SVM is then extended to data which is not fully linearly separable. This soft margin SVM introduces the idea of slack variables and the trade-off between maximizing the margin and minimizing the number of misclassified variables in the second section. The third section develops the concept of SVM further so that the technique can be used for regression. The fourth section explains the other salient feature of SVM - the Kernel Trick. It explains how incorporation of this mathematical sleight of hand allows SVM to classify and regress nonlinear data. Other than Cortes and Vapnik [1], most of this document is based on work by Cristianini and Shawe-Taylor [2], [3], Burges [4] and Bishop [5]. For any comments on or questions about this document, please contact the author through the URL on the title page. 1

1 Linearly Separable Binary Classification 1.1 Theory We have L training points, where each input x i has D attributes (i.e. is of dimensionality D) and is in one of two classes y i = -1 or +1, i.e our training data is of the form: {x i, y i } where i = 1... L, y i { 1, 1}, x R D Here we assume the data is linearly separable, meaning that we can draw a line on a graph of x 1 vs x 2 separating the two classes when D = 2 and a hyperplane on graphs of x 1, x 2... x D for when D > 2. This hyperplane can be described by w x + b = 0 where: w is normal to the hyperplane. b w is the perpendicular distance from the hyperplane to the origin. Support Vectors are the examples closest to the separating hyperplane and the aim of Support Vector Machines (SVM) is to orientate this hyperplane in such a way as to be as far as possible from the closest members of both classes. Figure 1: Hyperplane through two linearly separable classes Referring to Figure 1, implementing a SVM boils down to selecting the variables w and b so that our training data can be described by: x i w + b +1 for y i = +1 (1.1) x i w + b 1 for y i = 1 (1.2) These equations can be combined into: y i (x i w + b) 1 0 i (1.3) 2

If we now just consider the points that lie closest to the separating hyperplane, i.e. the Support Vectors (shown in circles in the diagram), then the two planes H 1 and H 2 that these points lie on can be described by: x i w + b = +1 for H 1 (1.4) x i w + b = 1 for H 2 (1.5) Referring to Figure 1, we define d 1 as being the distance from H 1 to the hyperplane and d 2 from H 2 to it. The hyperplane s equidistance from H 1 and H 2 means that d 1 = d 2 - a quantity known as the SVM s margin. In order to orientate the hyperplane to be as far from the Support Vectors as possible, we need to maximize this margin. 1 Simple vector geometry shows that the margin is equal to w and maximizing it subject to the constraint in (1.3) is equivalent to finding: min w such that y i (x i w + b) 1 0 i Minimizing w is equivalent to minimizing 1 2 w 2 and the use of this term makes it possible to perform Quadratic Programming (QP) optimization later on. We therefore need to find: min 1 2 w 2 s.t. y i (x i w + b) 1 0 i (1.6) In order to cater for the constraints in this minimization, we need to allocate them Lagrange multipliers α, where α i 0 i : L P 1 2 w 2 α [y i (x i w + b) 1 i ] (1.7) 1 2 w 2 1 2 w 2 α i [y i (x i w + b) 1] (1.8) α i y i (x i w + b) + α i (1.9) We wish to find the w and b which minimizes, and the α which maximizes (1.9) (whilst keeping α i 0 i ). We can do this by differentiating L P with respect to w and b and setting the derivatives to zero: L L P w = 0 w = α i y i x i (1.10) L P b L = 0 α i y i = 0 (1.11) 3

Substituting (1.10) and (1.11) into (1.9) gives a new formulation which, being dependent on α, we need to maximize: L D α i 1 α i α j y i y j x i x j s.t. α i 0 i, 2 i,j α i y i = 0 (1.12) α i 1 α i H ij α j where H ij y i y j x i x j (1.13) 2 i,j α i 1 2 αt Hα s.t. α i 0 i, α i y i = 0 (1.14) This new formulation L D is referred to as the Dual form of the Primary L P. It is worth noting that the Dual form requires only the dot product of each input vector x i to be calculated, this is important for the Kernel Trick described in the fourth section. Having moved from minimizing L P to maximizing L D, we need to find: max α [ L ] α i 1 2 αt Hα s.t. α i 0 i and α i y i = 0 (1.15) This is a convex quadratic optimization problem, and we run a QP solver which will return α and from (1.10) will give us w. What remains is to calculate b. Any data point satisfying (1.11) which is a Support Vector x s will have the form: y s (x s w + b) = 1 Substituting in (1.10): y s ( m S α m y m x m x s + b) = 1 Where S denotes the set of indices of the Support Vectors. S is determined by finding the indices i where α i > 0. Multiplying through by y s and then using y 2 s = 1 from (1.1) and (1.2): y 2 s( m S α m y m x m x s + b) = y s b = y s m S α m y m x m x s 4

Instead of using an arbitrary Support Vector x s, it is better to take an average over all of the Support Vectors in S: b = 1 N s s S(y s m S α m y m x m x s ) (1.16) We now have the variables w and b that define our separating hyperplane s optimal orientation and hence our Support Vector Machine. 5

1.2 Application In order to use an SVM to solve a linearly separable, binary classification problem we need to: Create H, where H ij = y i y j x i x j. Find α so that α i 1 2 αt Hα is maximized, subject to the constraints α i 0 i and α i y i = 0. This is done using a QP solver. Calculate w = α i y i x i. Determine the set of Support Vectors S by finding the indices such that α i > 0. Calculate b = 1 N s s s S(y α m y m x m x s ). m S Each new point x is classified by evaluating y = sgn(w x + b). 6

2 Binary Classification for Data that is not Fully Linearly Separable 2.1 Theory In order to extend the SVM methodology to handle data that is not fully linearly separable, we relax the constraints for (1.1) and (1.2) slightly to allow for misclassified points. This is done by introducing a positive slack variable ξ i, i = 1,... L : x i w + b +1 ξ i for y i = +1 (2.1) x i w + b 1 + ξ i for y i = 1 (2.2) ξ i 0 i (2.3) Which can be combined into: y i (x i w + b) 1 + ξ i 0 where ξ i 0 i (2.4) Figure 2: Hyperplane through two non-linearly separable classes In this soft margin SVM, data points on the incorrect side of the margin boundary have a penalty that increases with the distance from it. As we are trying to reduce the number of misclassifications, a sensible way to adapt our objective function (1.6) from previously, is to find: min 1 2 w 2 + C ξ i s.t. y i (x i w + b) 1 + ξ i 0 i (2.5) Where the parameter C controls the trade-off between the slack variable penalty and the size of the margin. Reformulating as a Lagrangian, which as before we need to minimize with respect to w, b and ξ i and maximize 7

with respect to α (where α i 0, µ i 0 i ): L P 1 2 w 2 + C ξ i α i [y i (x i w + b) 1 + ξ i ] µ i ξ i (2.6) Differentiating with respect to w, b and ξ i and setting the derivatives to zero: L L P w = 0 w = α i y i x i (2.7) L P b L = 0 α i y i = 0 (2.8) L P ξ i = 0 C = α i + µ i (2.9) Substituting these in, L D has the same form as (1.14) before. However (2.9) together with µ i 0 i, implies that α C. We therefore need to find: max α [ L ] α i 1 2 αt Hα s.t. 0 α i C i and α i y i = 0 (2.10) b is then calculated in the same way as in (1.6) before, though in this instance the set of Support Vectors used to calculate b is determined by finding the indices i where 0 < α i < C. 8

2.2 Application In order to use an SVM to solve a binary classification for data that is not fully linearly separable we need to: Create H, where H ij = y i y j x i x j. Choose how significantly misclassifications should be treated, by selecting a suitable value for the parameter C. Find α so that α i 1 2 αt Hα is maximized, subject to the constraints 0 α i C i and α i y i = 0. This is done using a QP solver. Calculate w = α i y i x i. Determine the set of Support Vectors S by finding the indices such that 0 < α i < C. Calculate b = 1 N s s s S(y α m y m x m x s ). m S Each new point x is classified by evaluating y = sgn(w x + b). 9

3 Support Vector Machines for Regression 3.1 Theory Instead of attempting to classify new unseen variables x into one of two categories y = ±1, we now wish to predict a real-valued output for y so that our training data is of the form: {x i, y i } where i = 1... L, y i R, x R D y i = w x i + b (3.1) Figure 3: Regression with ɛ-insensitive tube The regression SVM will use a more sophisticated penalty function than before, not allocating a penalty if the predicted value y i is less than a distance ɛ away from the actual value t i, i.e. if t i y i < ɛ. Referring to Figure 3, the region bound by y i ± ɛ i is called an ɛ-insensitive tube. The other modification to the penalty function is that output variables which are outside the tube are given one of two slack variable penalties depending on whether they lie above (ξ + ) or below (ξ ) the tube (where ξ + > 0, ξ > 0 i ): t i y i + ɛ + ξ + (3.2) t i y i ɛ ξ (3.3) The error function for SVM regression can then be written as: C (ξ i + + ξi ) + 1 2 w 2 (3.4) This needs to be minimized subject to the constraints ξ + 0, ξ 0 i and (3.2) and (3.3). In order to do this we introduce Lagrange multipliers 10

α + i 0, α i 0, µ + i 0 µ i 0 i : L L P = C (ξ i + +ξ i )+1 2 w 2 (µ + i ξ+ i +µ i ξ i ) α i + (ɛ+ξ+ i +y i t i ) αi (ɛ+ξ i y i+t i ) (3.5) Substituting for y i, differentiating with respect to w, b, ξ + and ξ and setting the derivatives to 0: L L P w = 0 w = i + αi )x i (3.6) L P b L P ξ + i L P ξ i L = 0 i + αi ) = 0 (3.7) = 0 C = α + i + µ + i (3.8) = 0 C = α i + µ i (3.9) Substituting (3.6) and (3.7) in, we now need to maximize L D with respect to α + i and α i + i 0, α i 0 i ) where: L D = i + α i )t i ɛ i + α i ) 1 i + 2 α i )+ j α j )x i x j (3.10) i,j Using µ + i 0 and µ i 0 together with (3.8) and (3.9) means that α i + C and αi C. We therefore need to find: max + α +,α i αi )t i ɛ i + αi ) 1 i + αi 2 )+ j α j )x i x j such that 0 α + i C, 0 α i C and i,j i + αi ) = 0 i. Substituting (3.6) into (3.1), new predictions y can be found using: y = (3.11) i + αi )x i x + b (3.12) A set S of Support Vectors x s can be created by finding the indices i where 0 < α < C and ξ + i = 0 (or ξ i = 0). 11

This gives us: b = t s ɛ m + αm)x m x s (3.13) m =S As before it is better to average over all the indices i in S: [ ] b = 1 t s ɛ m + α N m)x m x s s s S m =S (3.14) 12

3.2 Application In order to use an SVM to solve a regression problem we need to: Choose how significantly misclassifications should be treated and how large the insensitive loss region should be, by selecting suitable values for the parameters C and ɛ. Find α + and α so that: i + αi )t i ɛ i + αi ) 1 i + αi 2 )+ j α j )x i x j is maximized, subject to the constraints i,j 0 α + i C, 0 α i C and i + αi ) = 0 i. This is done using a QP solver. Calculate w = i + αi )x i. Determine the set of Support Vectors S by finding the indices i where 0 < α < C and ξ i = 0. Calculate [ b = 1 N s t i ɛ i + αi )x i x m ]. s S m=1 Each new point x is determined by evaluating y = i + αi )x i x + b. 13

4 Nonlinear Support Vector Machines 4.1 Theory When applying our SVM to linearly separable data we have started by creating a matrix H from the dot product of our input variables: H ij = y i y j k(x i, x j ) = x i x j = x T i x j (4.1) k(x i, x j ) is an example of a family of functions called Kernel Functions (k(x i, x j ) = x T i x j being known as a Linear Kernel). The set of kernel functions is composed of variants of (4.2) in that they are all based on calculating inner products of two vectors. This means that if the functions can be recast into a higher dimensionality space by some potentially non-linear feature mapping function x φ(x), only inner products of the mapped inputs in the feature space need be determined without us needing to explicitly calculate φ. The reason that this Kernel Trick is useful is that there are many classification/regression problems that are not linearly separable/regressable in the space of the inputs x, which might be in a higher dimensionality feature space given a suitable mapping x φ(x). Figure 4: Dichotomous data re-mapped using Radial Basis Kernel Refering to Figure 4, if we define our kernel to be: k(x i, x j ) = e ( x i x j 2 2σ 2 ) (4.2) then a data set that is not linearly separable in the two dimensional data space x (as in the left hand side of Figure 4 ) is separable in the nonlinear 14

feature space (right hand side of Figure 4 ) defined implicitly by this nonlinear kernel function - known as a Radial Basis Kernel. Other popular kernels for classification and regression are the Polynomial Kernel k(x i, x j ) = (x i x j + a) b and the Sigmoidal Kernel k(x i, x j ) = tanh(ax i x j b) where a and b are parameters defining the kernel s behaviour. There are many kernel functions, including ones that act upon sets, strings and even music. There are requirements for a function to be applicable as a kernel function that lie beyond the scope of this very brief introduction to the area. The author therefore recomends sticking with the three mentioned above to start with. 15

4.2 Application In order to use an SVM to solve a classification or regression problem on data that is not linearly separable, we need to first choose a kernel and relevant parameters which you expect might map the non-linearly separable data into a feature space where it is linearly separable. This is more of an art than an exact science and can be achieved empirically - e.g. by trial and error. Sensible kernels to start with are the Radial Basis, Polynomial and Sigmoidal kernels. The first step, therefore, consists of choosing our kernel and hence the mapping x φ(x). For classification, we would then need to: Create H, where H ij = y i y j φ(x i ) φ(x j ). Choose how significantly misclassifications should be treated, by selecting a suitable value for the parameter C. Find α so that α i 1 2 αt Hα is maximized, subject to the constraints 0 α i C i and α i y i = 0. This is done using a QP solver. Calculate w = α i y i φ(x i ). Determine the set of Support Vectors S by finding the indices such that 0 < α i < C. Calculate b = 1 N s s s S(y α m y m φ(x m ) φ(x s )). m S Each new point x is classified by evaluating y = sgn(w φ(x ) + b). For regression, we would then need to: Choose how significantly misclassifications should be treated and how large the insensitive loss region should be, by selecting suitable values for the parameters C and ɛ. 16

Find α + and α so that: i + α i )t i ɛ i + α i ) 1 i + 2 α i )+ j α j )φ(x i) φ(x j ) i,j is maximized, subject to the constraints 0 α + i C, 0 α i C and i + αi ) = 0 i. This is done using a QP solver. Calculate w = i + αi )φ(x i). Determine the set of Support Vectors S by finding the indices i where 0 < α < C and ξ i = 0. Calculate [ b = 1 N s t i ɛ s S ] i + αi )φ(x i) φ(x m ). m=1 Each new point x is determined by evaluating y = i + αi )φ(x i) φ(x ) + b. 17

References [1] C. Cortes, V. Vapnik, in Machine Learning, pp. 273 297 (1995). [2] N. Cristianini, J. Shawe-Taylor, An introduction to support Vector Machines: and other kernel-based learning methods. Cambridge University Press, New York, NY, USA (2000). [3] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA (2004). [4] C. J. C. Burges, Data Mining and Knowledge Discovery 2, 121 (1998). [5] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Springer (2006). 18