Support Vector Machines

Similar documents
LECTURE 7 Support vector machines

Lecture Support Vector Machine (SVM) Classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines: Maximum Margin Classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CSC 411 Lecture 17: Support Vector Machine

Introduction to Support Vector Machines

Support vector machines

Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machine (SVM) and Kernel Methods

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machine

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Perceptron Revisited: Linear Separators. Support Vector Machines

Announcements - Homework

Convex Optimization and Support Vector Machine

Jeff Howbert Introduction to Machine Learning Winter

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines, Kernel SVM

ICS-E4030 Kernel Methods in Machine Learning

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

SVMC An introduction to Support Vector Machines Classification

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machine

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Linear & nonlinear classifiers

Support Vector Machines

Lecture Notes on Support Vector Machine

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Statistical Machine Learning from Data

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machine (SVM) and Kernel Methods

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines for Classification and Regression

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Lecture 10: A brief introduction to Support Vector Machine

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Linear & nonlinear classifiers

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Polyhedral Computation. Linear Classifiers & the SVM

Support Vector Machine (continued)

CS798: Selected topics in Machine Learning

Support Vector Machines for Regression

Support Vector Machines

Applied inductive learning - Lecture 7

Soft-Margin Support Vector Machine

CS-E4830 Kernel Methods in Machine Learning

Support Vector Machines

Homework 3. Convex Optimization /36-725

Pattern Recognition 2018 Support Vector Machines

COMP 875 Announcements

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Learning with kernels and SVM

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Machine Learning And Applications: Supervised Learning-SVM

SUPPORT VECTOR MACHINE

An introduction to Support Vector Machines

Support Vector Machines and Kernel Methods

Lecture 3 January 28

Machine Learning A Geometric Approach

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Max Margin-Classifier

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machines

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Kernel Methods and Support Vector Machines

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Support vector machines Lecture 4

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Lecture 18: Optimization Programming

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Kernelized Perceptron Support Vector Machines

L5 Support Vector Classification

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning Techniques

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

Support Vector Machines Explained

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Direct L2 Support Vector Machine

SMO Algorithms for Support Vector Machines without Bias Term

Support Vector Machines

Optimality, Duality, Complementarity for Constrained Optimization

10701 Recitation 5 Duality and SVM. Ahmed Hefny

SVM and Kernel machine

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Review: Support vector machines. Machine learning techniques and image analysis

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Transcription:

Support Vector Machines Ryan M. Rifkin Google, Inc. 2008

Plan Regularization derivation of SVMs Geometric derivation of SVMs Optimality, Duality and Large Scale SVMs

The Regularization Setting (Again) We are given l examples (x 1, y 1 ),...,(x l, y l ), with x i R n and y i { 1, 1} for all i. As mentioned last class, we can find a classification function by solving a regularized learning problem: 1 min f H l V(y i, f(x i )) + λ f 2 K. i=1 Note that in this class we are specifically consider binary classification.

The Hinge Loss The classical SVM arises by considering the specific loss function V(f(x, y)) (1 yf(x)) +, where (k) + max(k, 0).

The Hinge Loss 4 3.5 3 2.5 Hinge Loss 2 1.5 1 0.5 0 3 2 1 0 1 2 3 y * f(x)

Substituting In The Hinge Loss With the hinge loss, our regularization problem becomes 1 min f H l (1 y i f(x i )) + + λ f 2 K. i=1 Note that we don t have a 1 2 term. multiplier on the regularization

Slack Variables This problem is non-differentiable (because of the kink in V ), so we introduce slack variables ξ i, to make the problem easier to work with: min f H 1 l l i=1 ξ i + λ f 2 K subject to : y i f(x i ) 1 ξ i i = 1,...,l ξ i 0 i = 1,...,l

Applying The Representer Theorem Substituting in: f (x) = c i K(x, x i ), i=1 we arrive at a constrained quadratic programming problem: 1 min l c R l,ξ R l l i=1 ξ i + λc T Kc subject to : y i l j=1 c jk(x i, x j ) 1 ξ i i = 1,...,l ξ i 0 i = 1,...,l

Adding A Bias Term If we add an unregularized bias term b, which presents some theoretical difficulties to be discussed later, we arrive at the primal SVM: 1 min l c R l,b R,ξ R l l i=1 ξ i + λc T Kc subject to : y i ( l j=1 c jk(x i, x j ) + b) 1 ξ i i = 1,...,l ξ i 0 i = 1,...,l

Forming the Lagrangian For reasons that will be clear in a few slides, we derive the Wolfe dual quadratic program using Lagrange multiplier techniques: L(c,ξ, b,α,ζ) = 1 l ξ i + λc T Kc i=1 α i (y i { c j K(x i, x j ) + b} 1 + ξ i ) i=1 ζ i ξ i i=1 j=1 We want to minimize L with respect to c, b, and ξ, and maximize L with respect to α and ζ, subject to the constraints of the primal problem and nonnegativity constraints on α and ζ.

Eliminating b and ξ L b = 0 = α i y i = 0 i=1 L ξ i = 0 = 1 l α i ζ i = 0 = 0 α i 1 l We write a reduced Lagrangian in terms of the remaining variables: L R (c,α) = λc T Kc α i (y i i=1 c j K(x i, x j ) 1) j=1

Eliminating c Assuming the K matrix is invertible, L R c = 0 = 2λKc KYα = 0 = c i = α iy i 2λ Where Y is a diagonal matrix whose i th diagonal element is y i ; Yα is a vector whose i th element is α i y i.

The Dual Program Substituting in our expression for c, we are left with the following dual program: max α R l l i=1 α i 1 subject to : l i=1 y iα i = 0 Here, Q is the matrix defined by 4λ αt Qα 0 α i 1 l i = 1,...,l Q = yky T Q ij = y i y j K(x i, x j ).

Standard Notation In most of the SVM literature, instead of the regularization parameter λ, regularization is controlled via a parameter C, defined using the relationship C = 1 2λl. Using this definition (after multiplying our objective function by the constant 1 2λ, the basic regularization problem becomes min C f H V(y i, f(x i )) + 1 2 f 2 K. i=1 Like λ, the parameter C also controls the tradeoff between classification accuracy and the norm of the function. The primal and dual problems become...

The Reparametrized Problems min c R l,b R,ξ R l C l i=1 ξ i + 1 2 ct Kc subject to : y i ( l j=1 c jk(x i, x j ) + b) 1 ξ i i = 1,...,l ξ i 0 i = 1,...,l max α R l subject to : l i=1 α i 1 2 αt Qα l i=1 y iα i = 0 i = 1,...,l 0 α i C i = 1,...,l

The Geometric Approach The traditional approach to developing the mathematics of SVM is to start with the concepts of separating hyperplanes and margin. The theory is usually developed in a linear space, beginning with the idea of a perceptron, a linear hyperplane that separates the positive and the negative examples. Defining the margin as the distance from the hyperplane to the nearest example, the basic observation is that intuitively, we expect a hyperplane with larger margin to generalize better than one with smaller margin.

Large and Small Margin Hyperplanes (a) (b)

Classification With Hyperplanes We denote our hyperplane by w, and we will classify a new point x via the function f(x) = sign (w x). (1) Given a separating hyperplane w we let x be a datapoint closest to w, and we let x w be the unique point on w that is closest to x. Obviously, finding a maximum margin w is equivalent to maximizing x x w...

Deriving the Maximal Margin, I For some k (assume k > 0 for convenience), w x = k w x w = 0 = w (x x w ) = k

Deriving the Maximal Margin, II Noting that the vector x x w is parallel to the normal vector w, ( x x w (x x w w ) ) = w w w = w 2 x x w w = w x x w = w (x x w ) = k = x x w = k w

Deriving the Maximal Margin, III k is a nuisance paramter. WLOG, we fix k to 1, and see that maximizing x x w is equivalent to maximizing 1 w, which in turn is equivalent to minimizing w or w 2. We can now define the margin as the distance between the hyperplanes w x = 0 and w x = 1.

The Linear, Homogeneous, Separable SVM min w 2 w R n subject to : y i (w x) 1 i = 1,...,l

Bias and Slack The SVM introduced by Vapnik includes an unregularized bias term b, leading to classification via a function of the form: f(x) = sign (w x + b). In practice, we want to work with datasets that are not linearly separable, so we introduce slacks ξ i, just as before. We can still define the margin as the distance between the hyperplanes w x = 0 and w x = 1, but this is no longer particularly geometrically satisfying.

The New Primal With slack variables, the primal SVM problem becomes min w R n,ξ R n,b R C l i=1 ξ i + 1 2 w 2 subject to : y i (w x + b) 1 ξ i ξ i 0 i = 1,...,l i = 1,...,l

Historical Perspective Historically, most developments begin with the geometric form, derived a dual program which was identical to the dual we derived above, and only then observed that the dual program required only dot products and that these dot products could be replaced with a kernel function.

More Historical Perspective In the linearly separable case, we can also derive the separating hyperplane as a vector parallel to the vector connecting the closest two points in the positive and negative classes, passing through the perpendicular bisector of this vector. This was the Method of Portraits, derived by Vapnik in the 1970 s, and recently rediscovered (with non-separable extensions) by Keerthi.

The Primal and Dual Problems Again min c R l,b R,ξ R l C l i=1 ξ i + 1 2 ct Kc subject to : y i ( l j=1 c jk(x i, x j ) + b) 1 ξ i i = 1,...,l ξ i 0 i = 1,...,l max α R l subject to : l i=1 α i 1 2 αt Qα l i=1 y iα i = 0 0 α i C i = 1,...,l

Optimal Solutions The primal and the dual are both feasible convex quadratic programs. Therefore, they both have optimal solutions, and optimal solutons to the primal and the dual have the same objective value.

The Reparametrized Lagrangian We derived the dual from the primal using the (now reparameterized) Lagrangian: L(c,ξ, b,α,ζ) = C ξ i + c T Kc i=1 α i (y i { c j K(x i, x j ) + b} 1 + ξ i ) i=1 ζ i ξ i i=1 j=1

The Reparametrized Dual, I The reduced Lagrangian: L b = α i y i = 0 i=1 L ξ i = 0 = 1 l α i ζ i = 0 L R (c,α) = c T Kc = 0 α i 1 l α i (y i i=1 The relation between c and α: c j K(x i, x j ) 1) j=1 L c = 0 = c i = α i y i

Complementary Slackness The dual variables are associated with the primal constraints as follows: α i = y i { c j K(x i, x j ) + b} 1 + ξ i j=1 ζ i = ξ i 0 Complementary slackness tells us that at optimality, either the primal inequality is satisfied with equality or the dual variable is zero. In other words, if c, ξ, b, α and ζ are optimal solutions to the primal and dual, then α i (y i { c j K(x i, x j ) + b} 1 + ξ i ) = 0 j=1 ζ i ξ i = 0

Optimality Conditions All optimal solutions must satisfy: c j K(x i, x j ) y i α j K(x i, x j ) = 0 j=1 j=1 α i y i = 0 i=1 C α i ζ i = 0 y i ( y j α j K(x i, x j ) + b) 1 + ξ i 0 j=1 α i [y i ( y j α j K(x i, x j ) + b) 1 + ξ i ] = 0 j=1 ζ i ξ i = 0 ξ i,α i,ζ i 0 i = 1,...,l i = 1,...,l i = 1,...,l i = 1,...,l i = 1,...,l i = 1,...,l

Optimality Conditions, II The optimality conditions are both necessary and sufficient. If we have c, ξ, b, α and ζ satisfying the above conditions, we know that they represent optimal solutions to the primal and dual problems. These optimality conditions are also known as the Karush-Kuhn-Tucker (KKT) conditons.

Toward Simpler Optimality Conditions Determining b Suppose we have the optimal α i s. Also suppose (this happens in practice) that there exists an i satisfying 0 < α i < C. Then α i < C = ζ i > 0 = ξ i = 0 = y i ( y j α j K(x i, x j ) + b) 1 = 0 j=1 = b = y i y j α j K(x i, x j ) j=1 So if we know the optimal α s, we can determine b.

Towards Simpler Optimality Conditions, I Defining our classification function f(x) as f(x) = y i α i K(x, x i ) + b, i=1 we can derive reduced optimality conditions. For example, consider an i such that y i f(x i ) < 1: y i f(x i ) < 1 = ξ i > 0 = ζ i = 0 = α i = C

Towards Simpler Optimality Conditions, II Conversely, suppose α i = C: α i = C = y i f(x i ) 1 + ξ i = 0 = y i f(x i ) 1

Reduced Optimality Conditions Proceeding similarly, we can write the following reduced optimality conditions (full proof: homework): α i = 0 = y i f(x i ) 1 0 < α i < C = y i f(x i ) = 1 α i = C = y i f(x i ) < 1 α i = 0 = y i f(x i ) > 1 α i = C = y i f(x i ) 1

Geometric Interpretation of Reduced Optimality Conditions

SVM Training Our plan will be to solve the dual problem to find the α s, and use that to find b and our function f. The dual problem is easier to solve the primal problem. It has simple box constraints and a single inequality constraint, even better, we will see that the problem can be decomposed into a sequence of smaller problems.

Off-the-shelf QP software We can solve QPs using standard software. Many codes are available. Main problem the Q matrix is dense, and is l-by-l, so we cannot write it down. Standard QP software requires the Q matrix, so is not suitable for large problems.

Decomposition, I Partition the dataset into a working set W and the remaining points R. We can rewrite the dual problem as: max l α W R W, α R R R i=1 i W subject to : 1 2 [α W α R ] α i + i=1 α i i R [ QWW Q WR Q RW Q RR ][ αw i W y iα i + i R y iα i = 0 0 α i C, i α R ]

Decomposition, II Suppose we have a feasible solution α. We can get a better solution by treating the α W as variable and the α R as constant. We can solve the reduced dual problem: max α W R W subject to : (1 Q WR α R )α W 1 2 α W Q WW α W i W y iα i = i R y iα i 0 α i C, i W

Decomposition, III The reduced problems are fixed size, and can be solved using a standard QP code. Convergence proofs are difficult, but this approach seems to always converge to an optimal solution in practice.

Selecting the Working Set There are many different approaches. The basic idea is to examine points not in the working set, find points which violate the reduced optimality conditions, and add them to the working set. Remove points which are in the working set but are far from violating the optimality conditions.

Good Large-Scale Solvers SVM Light: http://svmlight.joachims.org SVM Torch: http://www.torch.ch libsvm: http://www.csie.ntu.edu.tw/~cjlin/libsvm/