Support Vector Machines

Similar documents
Support Vector Machine (continued)

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Lecture Notes on Support Vector Machine

Support Vector Machines, Kernel SVM

Support Vector Machine (SVM) and Kernel Methods

Jeff Howbert Introduction to Machine Learning Winter

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machine (SVM) and Kernel Methods

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines

Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machines and Speaker Verification

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines

Convex Optimization and Support Vector Machine

CS-E4830 Kernel Methods in Machine Learning

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines

Support Vector Machine

Learning From Data Lecture 25 The Kernel Trick

CS145: INTRODUCTION TO DATA MINING

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machine (SVM) and Kernel Methods

10701 Recitation 5 Duality and SVM. Ahmed Hefny

CSC 411 Lecture 17: Support Vector Machine

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support vector machines

Announcements - Homework

Perceptron Revisited: Linear Separators. Support Vector Machines

Introduction to Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Soft-Margin Support Vector Machine

Lecture 18: Optimization Programming

Pattern Recognition 2018 Support Vector Machines

Support Vector Machines for Classification and Regression

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Constrained Optimization and Lagrangian Duality

Lecture 10: Support Vector Machine and Large Margin Classifier

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Max Margin-Classifier

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

SVMs, Duality and the Kernel Trick

Statistical Pattern Recognition

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machines

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Linear & nonlinear classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

L5 Support Vector Classification

(Kernels +) Support Vector Machines

LMS Algorithm Summary

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CS798: Selected topics in Machine Learning

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Statistical Machine Learning from Data

Support Vector Machines

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

5. Duality. Lagrangian

Machine Learning Techniques

Support Vector Machines Explained

Support Vector Machines and Kernel Methods

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear & nonlinear classifiers

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Lecture 3 January 28

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Machine Learning And Applications: Supervised Learning-SVM

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines

Kernel Methods. Machine Learning A W VO

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Homework Set #6 - Solutions

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Statistical Methods for SVM

Lagrange duality. The Lagrangian. We consider an optimization program of the form

LECTURE 7 Support vector machines

Constrained Optimization and Support Vector Machines

Support Vector Machine

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Linear, Binary SVM Classifiers

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Kernel Methods and Support Vector Machines

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Convex Optimization Boyd & Vandenberghe. 5. Duality

Transcription:

Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal soft) margin hyperplanes, and kernels. Conceptually, give training data (x i, y i ), i = 1,..., n, with x i R d and y i { 1, +1}, we are going to 1. Put the feature vectors through a nonlinear map: x i Φ(x i ). This takes them from R d to some abstract Hilbert space H. 2. Fit a linear classifier in H. Since H has a valid inner product, there is a notion of inner product (hence projections onto subspaces) and distance in H, making this problem well-defined. 3. Translate the rule back to a function (classifier) in R d using h(x) = sign ( w, Φ(x) + b) The trick is that step 1 is actually implicit. By using a kernel, we avoid having to actually compute the map. This is also true for the inner product w, Φ(x) ; we will see that this can also be computed using a kernel. To see how this works, we turn to our recently acquired knowledge of constrained optimization and duality. 1

Dual of optimal soft margin The optimal soft-margin classifier is found by solving (SM) minimize b,w,ξ 1 2 w 2 2 + C n ξ i subject to y i (w T x i + b) 1 ξ i, i = 1,..., n. ξ i 0, i = 1,..., n. If we try to solve this program after we map x i Φ(x i ), then the optimization program is looking for a w in the same space as Φ(x i ). Since this space will in general be much higher dimensional than R d and indeed could very well be infinite dimensional this is problematic. Fortunately, duality and the KKT conditions will come to our rescue. Here are two facts, which we verify in the later sections of these notes. 1. The dual of program (SM) is (DSM) maximize α R n 1 2 subject to α i α j y i y j x T i x j + j=1 α i y i = 0 α i 0 α i C/n, i = 1,..., n. Note that this is an optimization program in n variables. In fact, just like (SM), it is a (concave) quadratic program with linear equality and inequality constraints. 2

2. Given the solution α to the dual, we can easily compute the primal solution w, b. The weights w are computed as w = α i y i x i. n=1 Recall that the weights are used to define a linear functional which the classifier evaluates, and note that the expression above implies w T x = α i y i x T i x. The offset b is simply 1 b = y i0 w T x i0 = y i0 α i y i x T i x i0, (1) where i 0 is any index where the inequality constraint for α i0 slack: 0 < α i0 < C/n. is The kernelized dual The dual program and the computation of the affine functional w T x+ b rely on computations of inner products between feature vectors. We can kernelize the procedure above by replacing the inner products with a kernel evaluation, x T i x j k(x i, x j ), 1 In practice, you might average the associated values over all such indexes that have this property. 3

where k(, ) is a symmetric positive semidefinite kernel. As we have seen, k(x i, x j ) = Φ(x i ), Φ(x j ) for some associated nonlinear mapping Φ( ) and inner product, in an abstract Hilbert space. We compute α by solving (KDSM) maximize α R N 1 2 subject to α i α j y i y j k(x i, x j ) + j=1 α i y i = 0 0 α i C/n, i = 1,..., n. α i Notice that no matter what k(, ) is, this is an optimization program in R n. Given a solution α to the above, we define our classifier as ( ) h (x) = sign α i y i k(x i, x) + b, (2) where b is defined as in (1). Notice that back in the feature space R d, this classifier is in general nonlinear, as the set { } x R d : α i y i k(x i, x) = b, is no longer a hyperplane. It is some smooth surface that depends on the properties of k(, ) (and of course the data). The term support vector machine means solving the optimization program (KDSM) to implement the classifier (2). As we dig a little deeper into how the dual is derived, we will see where this term support vector comes from. 4

General approach to kernelization Optimal soft margin is not the only linear machine learning algorithm that can be kernelized. It can be applied to other classification and regression algorithms as well (we will see more on the regression side of things a little later in the course). There are three main steps to kernelizing an algorithm: 1. Show that the training process depends only on inner products between input feature vectors, x T i x j. 2. Show that the applying the decision rule to a general x only involves computing inner products (i.e. w T x). (This means that your classification rule is linear.) 3. Replace all inner products with evaluations of the kernel function k(, ). In short, this idea of kernleization has applications well beyond SVMs. 5

Technical details: Primals and duals in convex programming For a general convex program 2 with inequality constraints, (P) minimize f(x) x subject to g m (x) 0, m = 1,..., M the Lagragian is L(x, λ) = f(x) + Recall that if x 0 is feasible, meaning then clearly And in fact, M λ m g m (x). m=1 g m (x 0 ) 0, for all m = 1,..., M, L(x 0, λ) f(x 0 ) for all λ 0. max λ 0 L(x 0, λ) = f(x 0 ). Note that if x is not feasible (g m (x ) > 0 for some m), then max λ 0 L(x, λ) = +. Thus solving (P) is the same as solving minimize x maximize λ 0 L(x, λ). 2 This means that f( ) and the g m ( ) are convex functions. 6

The dual optimization program switches the order of the min and max, solving (D) maximize d(λ) λ subject to λ 0, where d(λ) = min x L(x, λ). If p is the optimal value of the primal program (P) and d is the optimal value of the dual (D), then it is always true that d p. Under certain conditions on the constraint functions g m (x), we have strong duality, meaning d = p. One example of these conditions is that the g m (x) are affine 3, which is the case for optimal soft margin. In addition, when the g m (x) are affine, the KKT conditions are necessary and sufficient. That is, any solution x to the primal (P) and any solution λ to the dual (D) will obey: f(x ) + g m (x ) 0, m = 1,..., M, λ m 0, m = 1,..., M, λ m g m (x ) = 0, m = 1,..., M, M λ m g m (x ) = 0. m=1 3 This means they can be written as g m (x) = a T mx + b m for some vector a m and scalar b m. 7

Example: The dual of a QP We know that optimal soft margin is a quadratic program with linear inequality constraints (QP). So for a quick example of how to derive a dual function, let s look at the general QP: (QP) minimize 2 zt P z + c T z subject to Az q. z R N 1 Above, A is an M N matrix, and q R M. The expression Az q is a compact way to write the M affine inequality constraints g m (z) = a T mz q m 0, where a T m is mth row of A, and q m is the mth entry in q. For (QP) to be convex, the N N matrix P needs to be symmetric postive semidefinite (i.e. all of its eigenvalues are 0). In this example, we will also assume P is invertible the discussion below can be extended to the general case, but this assumption just makes the derivation cleaner. The Lagrangian is L(z, λ) = 1 2 zt P z + c T z + M λ m (a T mz q m ). m=1 Notice that we can write the last term above more compactly as M λ m (a T mz q m ) = λ T (Az q). m=1 This is a convex quadratic function, so finding its unconstrained minimum amounts to finding a point where the gradient is equal to zero. 8

Since z L(z, λ) = P z + c + A T λ, we know that min z L(z, λ) will be achieved when P z + c + A T λ = 0 z = P 1 (c + A T λ). Thus the dual function is d(λ) = 1 2 (c + AT λ) T P 1 (c + A T λ) c T P 1 (c + A T λ) + λ T ( AP 1 (c + A T λ) q) = 1 2 (c + AT λ) T P 1 (c + A T λ) q T λ, and the dual optimization program is (DQP) maximize 1 λ R M 2 (c + AT λ) T P 1 (c + A T λ) q T λ subject to λ 0. This is also a (concave) quadratic program with linear inequality constraints. 9

The dual of optimal soft margin Now let s derive the dual of the optimal soft margin program (SM) minimize b,w,ξ 1 2 w 2 2 + C n ξ i subject to y i (w T x i + b) 1 ξ i, i = 1,..., n. ξ i 0, i = 1,..., n. We could put this in the form of a QP and use our result above, but it is just as easy (and more enlightening) to compute the dual directly. We will use two sets of Lagrange multiplies, α, β R n for the two sets of linear inequality constraints above. The Lagrangian is L = 1 2 w 2 2 + C ξ i + α i (1 ξ i y i w T x i y i b) β i ξ i. n This function is smooth in all of its variables (b, w, ξ, α, β); it will be minimized in the primal variables b, w, ξ when 4 b L = α i y i = 0, w L = w α i y i x i = 0, ξ L = C n 1 α β = 0. The last relation above means we can eliminate the second set of lagrange multipliers by substituting β i = C/n α i. Plugging these 4 In this expression, 1 is a vector with every entry equal to 1. 10

relations in the Lagrangian gives us the dual function d(α) = 1 2 = 1 2 2 α i y i x i + 2 α i α i α j y i y j x T i x j + j=1 ( ) T α i y i α j y j x j x i j=1 α i, where this simplification implicitly incorporates the constraint that n α i y i = 0. Finally, this gives us the dual program (DSM) maximize α R n 1 2 subject to α i α j y i y j x T i x j + j=1 α i y i = 0 α i 0 α i C/n, i = 1,..., n. 11

Support vectors From the complementary slackness conditions (λ m g m (x ) = 0), we know that α i ( 1 ξ i y i (w T x i + b ) ) = 0, i = 1,..., n. Thus 5 α i > 0 w T x i + b = y i (1 ξ i ). This means the i for which α i > 0 correspond to data point x i that are on or inside the margin of separation. These are called the support vectors, and in practice, there tend to be a small number of them. One consequence for this is that the weights can be calculated from just this small portion of the data. Recall that we derived the following expression: w = α i y i x i = α i y i x i, where SV = set of support vectors i SV 5 Note that since y i { 1, +1}, y 1 i = y i. 12

If the set of support vectors is small, then computing w is can be very inexpensive. The other thing that the complementary slackness conditions tell us is that ( ) C n α i ξ i = 0 for all i = 1,..., n. This means that α i < C/n ξ i = 0. Combining this with the condition above means that 0 < α i < C/n b = y i w T x i. This gives us the relation that allows us to recover the affine offset from the dual solution. 13