Support Vector Machines

Similar documents
Support Vector Machine (continued)

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machine (SVM) and Kernel Methods

Pattern Recognition 2018 Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture Notes on Support Vector Machine

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machines Explained

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Statistical Pattern Recognition

Support Vector Machines: Maximum Margin Classifiers

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines

Support Vector Machines, Kernel SVM

Support Vector Machines and Speaker Verification

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

Linear & nonlinear classifiers

Machine Learning. Support Vector Machines. Manfred Huber

Support vector machines

Lecture Support Vector Machine (SVM) Classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machine

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Linear & nonlinear classifiers

Lecture 3 January 28

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machines.

Support Vector Machine

Perceptron Revisited: Linear Separators. Support Vector Machines

(Kernels +) Support Vector Machines

CS798: Selected topics in Machine Learning

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

ICS-E4030 Kernel Methods in Machine Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

ML (cont.): SUPPORT VECTOR MACHINES

Kernel Methods and Support Vector Machines

L5 Support Vector Classification

Convex Optimization and Support Vector Machine

Max Margin-Classifier

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machine for Classification and Regression

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Modelli Lineari (Generalizzati) e SVM

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machines and Kernel Methods

Announcements - Homework

CS-E4830 Kernel Methods in Machine Learning

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Statistical Machine Learning from Data

Brief Introduction to Machine Learning

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Learning with kernels and SVM

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 10: Support Vector Machine and Large Margin Classifier

Introduction to Support Vector Machines

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Soft-margin SVM can address linearly separable problems with outliers

Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines

Learning From Data Lecture 25 The Kernel Trick

Machine Learning 4771

Linear, Binary SVM Classifiers

Non-linear Support Vector Machines

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

LECTURE 7 Support vector machines

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Kernels and the Kernel Trick. Machine Learning Fall 2017

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Support Vector Machines

10701 Recitation 5 Duality and SVM. Ahmed Hefny

18.9 SUPPORT VECTOR MACHINES

Support Vector Machines

Machine Learning And Applications: Supervised Learning-SVM

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Polyhedral Computation. Linear Classifiers & the SVM

SVMs, Duality and the Kernel Trick

Introduction to Support Vector Machines

Transcription:

EE 17/7AT: Optimization Models in Engineering Section 11/1 - April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable Case: The Linear Hard-Margin Classifier First, a note that much of the material include in these notes is based on the introduction in 1]. Consider a binary classification prediction problem. Training data are given and we denote the m observations as T = {X, y} = {x i, y i } m, where x i R n, y i {+1, 1} accordingly X R m n ). First, we will consider the case where the two classes are linearly separable. That is, by an n-dimensional decision boundary which is the result of an n + 1-dimensional hyperplane). Furthermore, since there can exist multiple hyperplanes that split the classes, we would like to find the one with the maximal margin. The motivation for this being that the decision boundary will be prone to variations in the classes, thus minimizing expected risk, also referred to sometimes as generalization error. Thus, we need a learning method that will optimize over the parameters w = w 1,..., w n ] R n and b R to find the optimal hyperplane, which in its general form is given by dx, w, b) = w T x + b = n w i x i + b 1) Furthermore, after finding this hyperplane, it is sensible that our decision rule to have the property that for an unseen data point x: i) If dx, w, b) > 0, classify x as class 1 i.e. its associated y = +1) ii) If dx, w, b) < 0, classify x as class i.e. its associated y = 1) Thus, after finding the optimal parameter w, b our decision rule will take the form ˆfx) = sgn dx, w, b )) = sgn w T x + b ) 1 if x > 0, where sgnx) = 1 if x < 0 0 if x = 0 ) The w, b are the the optimal solution parameters to an optimization problem that we will present in the next section. If a situation arises where a point x evaluates to w T x + b = 0, then flip a coin or default to a specific class. As mentioned earlier, the decision boundary is a result of the intersection between the decision hyperplane and the input space R n. The boundary itself is given by w T x + b = 0. Note that the decision boundary is unaffected by the scaling of the parameters w, b) αw, αb). Thus, consider the idea of a canonical hyperplane, a concept that is tied directly to the so-called support vectors SVs). A canonical hyperplane satisfies the property that min x wt x i + b = 1 3) i X Alternatively, a canonical hyperplane dx, w, b) and fx) are the same and equal to ±1 for the support vectors, and dx, w, b) > fx) for the other training points non-support vectors). 1

Figure 1: OCSH Picture from Wikipedia) 1.1.1 The Optimal Canonical Separating Hyperplane We begin the derivation of the Optimal Canonical Separating Hyperplane OCSH) by definining it as a canonical hyperplane that separates the data and has maximal margin. Thus, we nee to solve an optimization problem that maximizes the margin M = / w subject to constraints that ensure the classes are separable Figure 1). This can be formulated as the convex optimization problem P1 : p := min w,b 1 wt w y i w T x i + b ] 1, i = 1,..., m 4) where the constraints where derived from our definition of a canonical hyperplane. What about the nonseparable case? Section 1. considers the case when the two groups are not linearly separable. We proceed by working with the problem in its dual space. First, the lagrangian of 4) is Lw, b, ) = 1 wt w + We know by the minimax inequality that p = min w,b i 1 yi w T x i + b ]) 5) max Lw, b, ) max min Lw, b, ) 0 0 w,b d 6) But of interest to us is when strong duality holds. Indeed, it does hold by Slater s Conditions, which are satisfied by our assumption of separability and the form of P1. Note that we can find the dual function g) by solving minimizing the lagrangian over w, b an unconstrained convex opt. problem). Taking partial derivatives we have that w w, b, ) = 0 = w = b w, b, ) = 0 = i y i x i 7) i y i = 0 8)

In addition, since strong duality holds, the complementary slackness C-S) conditions are i 1 y i w T x i + b ]) = 0 i = 1,..., m 9) Substituting the results from 7) and 8) into the Lagrangian, the dual function gives us the dual problem D1 : d : 0 g) = i 1 i 1 i j y i y j x T i x j 10) i j y i y j x T i x j i y i = 0 11) 1 T Q + 1 T T y = 0, 0 1) whereq) ij = y i y j x T i x j. We can also write Q = yy T XX T, where denotes the hadamard product. Note that 1) is a quadratic program QP), and for reasonably sized data it can be solved using a standard QP-solver e.g. CV, Mosek, SeDuMi, etc.). Methods for very large l or n data often require specific algorithms for SVM. Well-known implementations include LIBSVM and LIBLINEAR. Once has been found, the optimal primal variables w, b ) are given by w = i y i x i 13) Let I = {i : y i w T x i + b ] = 1}. By the complementary slackness conditions 9), we have that b = 1 y i w T x i for i I 14) = 1 N SV N SV k I y k w T x k 15) where N SV is the number of support vectors. The average is taken since the calculation of b is numerically sensitive. Note that we have derived the OCSH paratmers w, b ) as a linear combination of the training data points and that they are calculated using only the Support Vectors SVs) this follows by the complementary slackness conditions. Finally, our optimal decision rule, based on the data and the OCSH, becomes m ) f x) = sgn w T x + b ) = sgn i y i x T i x + b 16) However, it is naive to think that data can always be separated by a hyperplane, and thus an extension follows for overlapping classes. In section, we consider a non-linear boundaries. 1. Non-Separable Case: The Linear Soft-Margin Classfier If the classes overlap, we will not be able to find a feasible w, b) pair that satisfies the class constraints in 4). Thus, we introduced the idea of a soft-margin which allows data points to lie within the margins. Idea: Introduce slack variables ξ i i = 1,..., l) into the constraints and penalize them in objective. The new problem becomes 1 P : min w,b,ξ wt w + C ξ i 17) y i w T x i + b ] 1 ξ i and ξ i 0 i = 1,..., m 3

I ll note that many generalizations of this form have been made by exponetiating the ξ or replacing w with an w 1 and the like. We won t address those cases here. We return to solving P using the same duality ideas that we used in solving P1. First, the Lagrangian for P is Lw, b, ξ,, γ) = 1 wt w + C ξ i + i 1 ξi y i w T x i + b ]) γ i ξ i 18) First, we ll look at the dual problem by performing the inner minimization w w, b, ξ,, γ) = 0 = w = b w, b, ξ,, γ) = 0 = i y i x i 19) i y i = 0 0) w, b, ξ,, γ) = 0 = C = i + γ i, i = 1,..., m 1) ξ i Does Strong Duality hold? Yes. To see this, pick any w, b and let a = min y iw T x i + b) 1 i m Define ξ i = 1 a i then strict feasibility which includes active affine constraints) holds and slater s theorem implies strong duality. Again, we will take a look at the C-S conditions 1 ξi y i w T x i + b ]) = 0, i = 1,..., m ) i γ i ξ i = C i )ξ i = 0, i = 1,..., m 3) Plugging in 19), 0), and 1) give us back the same dual function as before 10). However, 1) and γ, 0 institute a new requirement on the dual problem so that it becomes D : max i 1 i j y i y j x T i x j 4) i y i = 0 0 i C, i = 1,..., m 1 T Q + 1 T 5) T y = 0 0 C 1 still remaining a QP. Up to this point, we have made no mention of what value C takes and what it represents. The penalty parameter C represents the trade-off between margin size and the number of misclassified points. For example, taking C = requires that all point be correctly classified this may be infeasible). On the other hand, taking C = 0, tailors 17) to focus on maximizing the margin. C is thus usually chosen using cross-validation to minimize estimated generalization error aka empirical risk). 4

1..1 Support Vectors After 5) is solved, the slackness conditions ) and 3) imply three scenarios for the training data points x i and the lagrange multipliers i associated with their classification constraints: 1. i = 0 and ξ i = 0): Then, the data point x i has been correctly classified.. 0 < i < C): By 1) and 3), ξ i 0, which implies that y i w T x i + b ] = 1. Thus, x i is a support vector. Note that the support vectors that satisfy 0 i < C are the unbounded or free support vectors. 3. i = C): Then by ), y i w T x i + b ] = 1 ξ i, ξ i 0, and x i is a SV. Note that the SVs with i = C are bounded support vectors; that is, they lie inside the margin. Furthermore, for 0 ξ i < 1, x i is correctly classified, but if ξ i 1, x i is misclassified. The Nonlinear Classifier.1 Feature Space Although the soft-margin linear classifier allowed us to consider linearly non-separable classes, its scope is still limited since data might and tend to) have nonlinear hypersurfaces that separate them. The idea here is to map our data x R n into some high-dimensional feature space F via a mapping Φ Φ : R n R f ), and then solve the linear classification problem in this space. How does this affect 4)? We just replace x i s with Φ i = Φx i ), but actually there s a better way to solve the problem without ever actually mapping the points explicitly. We will use the kernel trick to greatly reduce the number of necessary computations. Figure : Nonlinear Classifier Picture also from Wikipedia) 5

. Nonlinear Hypersurfaces If we want to create a hyperplane, in the high-dimensional feature space F we will need to be able to carry out inner-products. However, depending on the dimension of F, this could become very costly, first in mapping the data and secondly in taking inner products. Thus, we introduce the concept of the kernel function. A function which allows us to take the inner product of transformed by Φ) data without actually transforming the data. Huh? A kernel function satisfies the following The most popular kernels are Kx i, x j ) = Φ T x i )Φx j ) 6) Function Kx, x i )) Type of Classifier Positive-Definite x T x i Linear, dot product Cond. PD x T x i + b ) d Polynomial Degree d PD exp{ 1 x x i) T Σ 1 x x i )} Gaussian Radial Basis Function PD tanhx T x i + b) Multilayer Perceptron Cond. PD x ) 1 xi + β Inverse Multiquadratic Function PD may not be a valid kernel for some b. Σ is usually taken to be isotropic. I.e. Σ = σ I.3 The Kernel Trick After a feature space, and accordingly its kernel, has been selected, the linear classification problem with soft-margins becomes max i 1 i j y i y j Φ T i Φ j 7) i y i = 0 0 i C, i = 1,..., m i 1 i j y i y j Kx i, x j ) 8) i y i = 0 0 i C, i = 1,..., m 1 T H + 1 T 9) T y = 0 0 C 1 where H ij = y i y j Kx i, x j ). Herein lies the importance of the Gram matrix G, where G) ij = Kx i, x j ), because if G is positive definite PD), then the matrix y i y j G ij ] will be PD and the optimization problem above will be convex and have a unique solution. 6

3 Additional Exercises Question 1: SVM w/o Bias Term ) Formulate a soft-margin i.e. with ξ i ) SVM without the bias term, i.e. fx) = w T x. Derive the saddle point conditions, KKT conditions and the dual. Question : SVM - Primal, Dual QP s) Both the primal and dual forms of the soft-margin SVM are QP s. 1. In the linear case, what problems parameters determine the computational complexity of the primal and dual problems? In what regimes, would one be faster than the other?. Consider now the case when we use non-linear feature mapping. What can be said now? References 1] Lipo Wang Support Vector Machines: Theory and Applications. Springer, Netherlands, 005 7