LECTURE 7 Support vector machines

Similar documents
Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines for Classification and Regression

Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

Introduction to Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machine

Convex Optimization and Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

ICS-E4030 Kernel Methods in Machine Learning

Announcements - Homework

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Statistical Machine Learning from Data

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines and Kernel Methods

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Support Vector Machines, Kernel SVM

Perceptron Revisited: Linear Separators. Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Jeff Howbert Introduction to Machine Learning Winter

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Statistical Pattern Recognition

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Linear & nonlinear classifiers

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS798: Selected topics in Machine Learning

Max Margin-Classifier

Support vector machines

Support Vector Machines

Lecture 18: Optimization Programming

L5 Support Vector Classification

Support Vector Machine

Support Vector Machines

CS-E4830 Kernel Methods in Machine Learning

Support Vector Machine (continued)

Pattern Recognition 2018 Support Vector Machines

Linear & nonlinear classifiers

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Lecture 10: A brief introduction to Support Vector Machine

(Kernels +) Support Vector Machines

Machine Learning A Geometric Approach

Lecture Notes on Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Support Vector Machines

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010

SVM and Kernel machine

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machines

Lecture 3 January 28

Support Vector Machines

Lecture 2: Linear SVM in the Dual

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines

CS145: INTRODUCTION TO DATA MINING

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machines for Regression

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Kernel Methods. Machine Learning A W VO

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

ML (cont.): SUPPORT VECTOR MACHINES

Learning with kernels and SVM

Support Vector Machines for Classification: A Statistical Portrait

SVMs, Duality and the Kernel Trick

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

COMP 875 Announcements

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Applied inductive learning - Lecture 7

Constrained Optimization and Lagrangian Duality

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Kernelized Perceptron Support Vector Machines

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

SUPPORT VECTOR MACHINE

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008

Homework 3. Convex Optimization /36-725

Kernel Methods and Support Vector Machines

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Machine Learning And Applications: Supervised Learning-SVM

Support Vector Machines Explained

Support vector machines Lecture 4

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines

Transcription:

LECTURE 7 Support vector machines SVMs have been used in a multitude of applications and are one of the most popular machine learning algorithms. We will derive the SVM algorithm from two perspectives: Tikhonov regularization, and the more common geometric perspective. We will focus on the linear SVM. 7.. SVMs from Tikhonov regularization We start with Tikhonov regularization " 2IR p n and use the hinge loss functional n where (k) + := max(k, 0). V (y i, V (f,z i ):=n T x i )+ k k 2 # ( y i T x i ) +, 4 3.5 3 2.5 Hinge Loss 2.5 0.5 0 3 2 0 2 3 y * f(x) Figure. Hinge loss. The resulting optimization problem is " # X n (7.) n ( y T i x i ) + + k k 2, 2IR p 43

44 S. MUKHERJEE, PROBABILISTIC MACHINE LEARNING which is non-di erentiable at ( y i f(x i )) = 0 so we introduce slack variables and write the following constrained optimization problem: 2IR p n P n i + 2 subject to : y i T x i i i =,...,n i 0 i =,...,n. The SVM contains an unregularized bias term b so the separating hyperplane need not go through the origin. Plugging this form into the above constrained quadratic problem results in the primal SVM 2IR p, 2IR n,b2ir n P n i + k k 2 subject to : y i T x i + b i i =,...,n i 0 i =,...,n. Note the following trick, one can reparameterize = c j x j, j= as this is an advantageous representation since one now only needs n variables to parameterize rather than p. So we now rewrite the optimization problem as c2ir n, 2IR n,b2ir subject to : n P n i + P ij c ic j x T i x j P y i j c jx T j x i + b i i =,...,n i 0 i =,...,n. We now derive the Wolfe dual quadratic program using Lagrange multiplier techniques: X n L(c,, b,, ) = n i + i i i. X ij c i c j x T i x j 0 8 9 < X = @y i c : j x T j x i + b ; + A i We want to imize L with respect to c, b, and, and maximize L with respect to and, subject to the constraints of the primal problem and nonnegativity constraints on and. We first eliate b and by taking partial derivatives: @L n @b =0 =) X i y i =0 @L @ i =0 =) n j i i =0=) 0 apple i apple n.

LECTURE 7. SUPPORT VECTOR MACHINES 45 The above two conditions will be constraints that will have to be satisfied at optimality. This results in a reduced Lagrangian: 0 X L R (c, ) = c i c j x T i x j i @y i c j x T j x i A. We now eliate c X ij @L R @c =0 =) c i = iyi 2, Substituting the above expression for c into the reduced Lagrangian we are left with the following dual program: P n i max 2IR n subject to : where Q is the matrix defined by 4 T Q P n y i i =0 0 apple i apple n i =,...,n, Q ij = y i y j x T i x j., reg- In most of the SVM literature, instead of the regularization parameter ularization is controlled via a parameter C, defined using the relationship j C = 2 n. Like, the parameter C also controls the trade-o between classification accuracy and the norm of the function. The primal and dual problems become respectively: C P n c2ir n, 2IR n i + P 2 ij c ic j x T i x j Pn subject to : y i j= c jx T i x j + b i i =,...,n i 0 i =,...,n max 2IR n subject to : P n i 2 T Q P n y i i =0 0 apple i apple C i =,...,n. 7.2. SVMs from a geometric perspective The traditional approach to developing the mathematics of SVM is to start with the concepts of separating hyperplanes and margin. The theory is usually developed in a linear space, beginning with the idea of a perceptron, a linear hyperplane that separates the positive and the negative examples. Defining the margin as the distance from the hyperplane to the nearest example, the basic observation is that intuitively, we expect a hyperplane with larger margin to generalize better than one with smaller margin. We denote our hyperplane by w, and we will classify a new point x via the function (7.2) f(x) = sign [hw, xi].

46 S. MUKHERJEE, PROBABILISTIC MACHINE LEARNING (a) (b) Figure 2. Two hyperplanes (a) and (b) perfectly separate the data. However, hyperplane (b) has a larger margin and intuitively would be expected to be more accurate on new observations. Given a separating hyperplane w we let x be a datapoint closest to w, and we let x w be the unique point on w that is closest to x. Obviously, finding a maximum margin w is equivalent to maximizing x x w. So for some k (assume k>0 for convenience), hw, xi = k hw, x w i =0 hw, (x x w )i = k. Noting that the vector x x w is parallel to the normal vector w, x x hw, (x x w w )i = w, w w = w 2 x xw w = w x x w k = w (x x w ) k = x x w. w k is a nuisance parameter and without any loss of generality, we fix k to, and see that maximizing x x w is equivalent to maximizing w,whichinturn is equivalent to imizing w or w 2. We can now define the margin as the distance between the hyperplanes hw, xi = 0 and hw, xi =. So if the data is linear separable case and the hyperplanes run through the origin the maximum margin hyperplane is the one for which 2 w2ir n w subject to : y i hw, x i i i =,...,n.

LECTURE 7. SUPPORT VECTOR MACHINES 47 The SVM introduced by Vapnik includes an unregularized bias term b, leading to classification via a function of the form: f(x) = sign[hw, xi + b]. In addition, we need to work with datasets that are not linearly separable, so we introduce slack variables i, just as before. We can still define the margin as the distance between the hyperplanes hw, xi = 0 and hw, xi =, but the geometric intuition is no longer as clear or compelling. With the bias term and slack variables the primal SVM problem becomes w2ir n,b2ir C P n i + 2 w 2 subject to : y i (hw, xi + b) i i =,...,n i 0 i =,...,n. Using Lagrange multipliers we can derive the same dual from in the previous section. Historically, most developments begin with the geometric form, derived a dual program which was identical to the dual we derived above, and only then observed that the dual program required only dot products and that these dot products could be replaced with a kernel function. In the linearly separable case, we can also derive the separating hyperplane as a vector parallel to the vector connecting the closest two points in the positive and negative classes, passing through the perpendicular bisector of this vector. This was the Method of Portraits, derived by Vapnik in the 970 s, and recently rediscovered (with non-separable extensions) by Keerthi. 7.3. Optimality conditions The primal and the dual are both feasible convex quadratic programs. Therefore, they both have optimal solutions, and optimal solutions to the primal and the dual have the same objective value. We derived the dual from the primal using the (now reparameterized) Lagrangian: L(c,,b,, ) = C i + X c i c j x T i x j ij 0 8 9 < = i @y i c : j x T i x j + b ; + A i i i. We now consider the dual variables associated with the primal constraints: 8 9 < = i =) y i c j x T i x j + b : ; + i j= i =) i 0. Complementary slackness tells us that at optimality, either the primal inequality is satisfied at equality or the dual variable is zero. In other words, if c,, b, and j=

48 S. MUKHERJEE, PROBABILISTIC MACHINE LEARNING are optimal solutions to the primal and dual, then 0 8 9 < = i @y i c j x T i x j + b : ; + A i j= = 0 i i = 0 All optimal solutions must satisfy: i y i 0 @ c j x T i x j j= j= y i j x T i x j =0 i y i =0 i =,...,n C i i =0 i =,...,n y j j x T i x j + ba + i 0 i =,...,n j= 2 0 3 4y i @ y j j x T i x j + ba + i 5 =0 i =,...,n j= i i =0 i =,...,n i, i, i 0 i =,...,n The above optimality conditions are both necessary and su cient. If we have c,, b, and satisfying the above conditions, we know that they represent optimal solutions to the primal and dual problems. These optimality conditions are also known as the Karush-Kuhn-Tucker (KKT) conditions. Suppose we have the optimal i s. Also suppose (this always happens in practice ) that there exists an i satisfying 0 < i <C.Then i <C =) i > 0 =) i =0 0 =) y i @ y j j x T i x j + ba =0 j= X n =) b = y i y j j x T i x j j= So if we know the optimal s, we can detere b. Defining our classification function f(x) as f(x) = `X y i i x T x i + b,

LECTURE 7. SUPPORT VECTOR MACHINES 49 we can derive reduced optimality conditions. For example, consider an i such that y i f(x i ) < : Conversely, suppose i = C: y i f(x i ) < =) i > 0 =) i =0 =) i = C. i = C =) y i f(x i ) + i =0 =) y i f(x i ) apple. Figure 3. Ageometricinterpretationofthereducedoptimalityconditions. The open squares and circles correspond to cases where i =0. Thedark circles and squares correspond to cases where y i f(x i )=and i apple C, these are samples at the margin. The grey circles and squares correspond to cases where y i f(x i ) < and i = C. 7.4. Solving the SVM optimization problem Our plan will be to solve the dual problem to find the s, and use that to find b and our function f. The dual problem is easier to solve the primal problem. It has simple box constraints and a single inequality constraint, even better, we will see that the problem can be decomposed into a sequence of smaller problems. We can solve QPs using standard software. Many codes are available. Main problem the Q matrix is dense, and is n n, so we cannot write it down. Standard QP software requires the Q matrix, so is not suitable for large problems. To get around this memory issue we partition the dataset into a working set W and the remaining points R. We can rewrite the dual problem as: max W 2IR W, R 2IR R P n i2w i + P i i2r subject to : apple QWW apple 2 [ Q W R ] WR W Q RW Q RR R P i2w y i i + P i2r y i i =0 0 apple i apple C, 8i.

50 S. MUKHERJEE, PROBABILISTIC MACHINE LEARNING Suppose we have a feasible solution. We can get a better solution by treating the W as variable and the R as constant. We can solve the reduced dual problem: max ( Q WR R ) W W 2IR W 2 WQ WW W P subject to: i2w y i i = P i2r y i i 0 apple i apple C, 8i 2 W. The reduced problems are fixed size, and can be solved using a standard QP code. Convergence proofs are di cult, but this approach seems to always converge to an optimal solution in practice. An important issue in the decomposition is selecting the working set. There are many di erent approaches. The basic idea is to exae points not in the working set, find points which violate the reduced optimality conditions, and add them to the working set. Remove points which are in the working set but are far from violating the optimality conditions.