Polyhedral Computation. Linear Classifiers & the SVM

Similar documents
Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Perceptron Revisited: Linear Separators. Support Vector Machines

Statistical Machine Learning from Data

Support Vector Machines for Classification: A Statistical Portrait

Announcements - Homework

Jeff Howbert Introduction to Machine Learning Winter

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines

Support Vector Machine for Classification and Regression

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Learning with kernels and SVM

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines for Classification and Regression

SUPPORT VECTOR MACHINE

Linear & nonlinear classifiers

Linear & nonlinear classifiers

Support Vector Machine (continued)

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

CSC 411 Lecture 17: Support Vector Machine

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Pattern Recognition 2018 Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machines Explained

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines

(Kernels +) Support Vector Machines

CS145: INTRODUCTION TO DATA MINING

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) and Kernel Methods

Max Margin-Classifier

Support Vector Machine

Formulation with slack variables

ML (cont.): SUPPORT VECTOR MACHINES

Classification and Support Vector Machine

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Kernel Methods and Support Vector Machines

Support Vector Machines, Kernel SVM

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Applied inductive learning - Lecture 7

Support Vector Machine (SVM) and Kernel Methods

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machines: Maximum Margin Classifiers

SVMs, Duality and the Kernel Trick

Statistical learning theory, Support vector machines, and Bioinformatics

Support Vector Machines

Introduction to Support Vector Machines

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

SVM and Kernel machine

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machine (SVM) and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

Linear Classification and SVM. Dr. Xin Zhang

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

A Revisit to Support Vector Data Description (SVDD)

Support Vector Machines

Learning From Data Lecture 25 The Kernel Trick

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Machine Learning : Support Vector Machines

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning And Applications: Supervised Learning-SVM

Robust Kernel-Based Regression

Lecture Notes on Support Vector Machine

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines.

Support Vector Machines and Kernel Methods

Introduction to Support Vector Machines

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Lecture 10: Support Vector Machine and Large Margin Classifier

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machine

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Statistical Pattern Recognition

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

A Tutorial on Support Vector Machine

Convex Optimization and Support Vector Machine

Support Vector Machines

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Non-linear Support Vector Machines

SMO Algorithms for Support Vector Machines without Bias Term

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Transcription:

Polyhedral Computation Linear Classifiers & the SVM mcuturi@i.kyoto-u.ac.jp Nov 26 2010 1

Statistical Inference Statistical: useful to study random systems... Mutations, environmental changes etc. life is random! Inference: learn rules using observations assuming some stationarity. in this talk: yes/no rules = binary classification given an image, does it contain the photograph of a human face? given a patient s genome, is it safe/effective to give him medecine XYZ? given a patient s genome, is (s)he at risk of developing Parkinson s disease? etc. binary classification simple 0/1 predictions for well-understood problems Nov 26 2010 2

Data The Data we have: a bunch of vectors x 1,x 2,x 3,,x N. Ideally, to infer a yes/no rule, we need the correct answer for each vector. We consider thus a set of pairs of vector/bit training set = x i = x i 1 x i 2. x i d Rd, y i {0, 1} i=1..n For illustration purposes only we will consider vectors in the plane, d = 2. Points are easier to represent in 2 dimensions than in 20.000... The ideas for d 3 are exactly the same. Many thanks to J.P. Vert for some of the following slides Nov 26 2010 3

Classification Separation Surfaces for Vectors What is a classification rule? Nov 26 2010 4

Classification Separation Surfaces for Vectors Classification rule = a partition of R d into two sets Nov 26 2010 5

Classification Separation Surfaces for Vectors Can be defined by a single surface, e.g. a curved line Nov 26 2010 6

Classification Separation Surfaces for Vectors Even more simple: using straight lines and halspaces. Nov 26 2010 7

Linear Classifiers Straight lines (hyperplanes when d > 2) are the simplest type of classifiers. A hyperplane H c,b is a set in R d defined by a normal vector c R d a constant b R. as H c,b = {x R d c T x = b} Letting b vary we can slide the hyperplane across R d c 0 H c,b H c,0 Nov 26 2010 8

Linear Classifiers Exactly like lines in the plane, hypersurfaces divide R d into two halfspaces, { x R d c T x < b } { x R d c T x b } = R d Linear classifiers attribute the yes and no answers given arbitrary c and b. c NO H c,b YES Assuming we only look at halfspaces for the decision surface......how to choose the best (c, b ) given a training sample? Nov 26 2010 9

Linear Classifiers This specific question, training set {( x i R d, y i {0, 1} ) i=1..n has different answers. Depends on the meaning of best [4]: }???? = best c, b Linear Discriminant Analysis (or Fisher s Linear Discriminant); Logistic regression maximum likelihood estimation; Perceptron, a one-layer neural network; etc. Today s focus: the Support vector machine [5] Nov 26 2010 10

Classification Separation Surfaces for Vectors Nov 26 2010 11

Classification Separation Surfaces for Vectors Nov 26 2010 12

Classification Separation Surfaces for Vectors Nov 26 2010 13

Classification Separation Surfaces for Vectors Nov 26 2010 14

Linear classifier, some degrees of freedom Nov 26 2010 15

Linear classifier, some degrees of freedom Nov 26 2010 16

Linear classifier, some degrees of freedom Nov 26 2010 17

Linear classifier, some degrees of freedom Nov 26 2010 18

Which one is better? Nov 26 2010 19

A criterion to select a linear classifier: the margin [2] Nov 26 2010 20

A criterion to select a linear classifier: the margin [2] Nov 26 2010 21

A criterion to select a linear classifier: the margin [2] Nov 26 2010 22

A criterion to select a linear classifier: the margin [2] Nov 26 2010 23

A criterion to select a linear classifier: the margin [2] Nov 26 2010 24

Largest Margin Linear Classifier [2] Nov 26 2010 25

Support Vectors with Large Margin Nov 26 2010 26

In equations We assume (for the moment) that the data are linearly separable, i.e., that there exists (w,b) R d R such that: { w T x i + b > 0 if y i = 1, w T x i + b < 0 if y i = 1. Next, we give a formula to compute the margin as a function of w. Obviously, for any t R, H w,b = H tw,tb Thus w and b are defined up to a multiplicative constant. We need to take care of this in the definition of the margin Nov 26 2010 27

How to find the largest separating hyperplane? For the linear classifier f(x) = w T x + b, consider the interstice defined by the hyperplanes: f(x) = w T x + b = +1 f(x) = w T x + b = 1 w.x+b=0 x1 x2 w.x+b > +1 w.x+b < 1 w w.x+b=+1 w.x+b= 1 Consider x 1 and x 2 such that x 2 x 1 is parallel to w. Nov 26 2010 28

The margin is 2/ w Margin = 2/ w : the points x 1 and x 2 satisfy: { w T x 1 + b = 0, w T x 2 + b = 1. By subtracting we get w T (x 2 x 1 ) = 1, and therefore: where γ is by definition the margin. γ def = 2 x 2 x 1 = 2 w. Nov 26 2010 29

All training points should be on the appropriate side For positive examples (y i = 1) this means: w T x i + b 1 For negative examples (y i = 1) this means: w T x i + b 1 in both cases: i = 1,...,n, y i ( w T x i + b ) 1 Nov 26 2010 30

Finding the optimal hyperplane Find (w,b) which minimize: under the constraints: w 2 i = 1,...,n, y i ( w T x i + b ) 1 0. This is a classical quadratic program on R d+1 linear constraints - quadratic objective Nov 26 2010 31

Lagrangian In order to minimize: under the constraints: 1 2 w 2 i = 1,...,n, y i ( w T x i + b ) 1 0. introduce one dual variable α i for each constraint, one constraint for each training point. the Lagrangian is, for α 0 (that is for each α i 0) L(w,b, α) = 1 2 w 2 n ( ( α i yi w T x i + b ) 1 ). i=1 Nov 26 2010 32

The Lagrange dual function g(α) = is only defined when inf w R d,b R { 1 2 w 2 n ( ( α i yi w T x i + b ) 1 )} i=1 w = 0 = n α i y i x i, ( derivating w.r.t w) ( ) i=1 n α i y i, (derivating w.r.t b) ( ) i=1 substituting ( ) in g, and using ( ) as a constraint, get the dual function g(α). To solve the dual problem, maximize g w.r.t. α. Strong duality holds. KKT gives us α i (y i ( w T x i + b ) 1) = 0,...hence, either α i = 0 or y i ( w T x i + b ) = 1. α i 0 only for points on the support hyperplanes {(x,y) y i (w T x i + b) = 1}. Nov 26 2010 33

Dual optimum The dual problem is thus maximize g(α) = n i=1 α i 1 n 2 i,j=1 α iα j y i y j x T i x j such that α 0, n i=1 α iy i = 0. This is a quadratic program in R n, with box constraints. α can be computed using optimization software (e.g. built-in matlab function) Nov 26 2010 34

Recovering the optimal hyperplane With α, we recover (w T,b ) corresponding to the optimal hyperplane. w T is given by w T = n i=1 y iα i x T i, b is given by the conditions on the support vectors α i > 0, y i (w T x i + b) = 1, b = 1 2 ( min y i =1,α i >0 (wt x i ) + ) max y i = 1,α i >0 (wt x i ) the decision function is therefore: f (x) = w T x + b n = y i α i x T i x + b. i=1 Here the dual solution gives us directly the primal solution. Nov 26 2010 35

Interpretation: support vectors α=0 α>0 Nov 26 2010 36

Another interpretation: Convex Hulls [1] go back to 2 sets of points that are linearly separable Nov 26 2010 37

Another interpretation: Convex Hulls Linearly separable = convex hulls do not intersect Nov 26 2010 38

Another interpretation: Convex Hulls Find two closest points, one in each convex hull Nov 26 2010 39

Another interpretation: Convex Hulls The SVM = bisection of that segment Nov 26 2010 40

Another interpretation: Convex Hulls support vectors = extreme points of the faces on which the two points lie Nov 26 2010 41

A brief proof through duality Suppose that all the points of the blue set are in a matrix A R d n 1, all the points of the red set are in a matrix B R d n 1 A =.... x 1 x n 1 R d n 1, B = x 1 x n 1 R d n 1..... Finding the two points in question, and the minimal distance, is given by minimize Au Bv 2 subject to 1 T n 1 u = 1 T n 1 v = 1 0 u R n 1,v R n 1 Possible to prove that the primal SVM program, slightly modified, has this dual. A bit tedious unfortunately. Nov 26 2010 42

What happens when the data is not linearly separable? Nov 26 2010 43

What happens when the data is not linearly separable? Nov 26 2010 44

What happens when the data is not linearly separable? Nov 26 2010 45

What happens when the data is not linearly separable? Nov 26 2010 46

Soft-margin SVM [3] Find a trade-off between large margin and few errors. Mathematically: min f { } 1 margin(f) + C errors(f) C is a parameter Nov 26 2010 47

Soft-margin SVM formulation [3] The margin of a labeled point (x,y) is margin(x,y) = y ( w T x + b ) The error is 0 if margin(x,y) > 1, 1 margin(x, y) otherwise. The soft margin SVM solves: min w,b { w 2 + C n ( max{0, 1 y i w T x i + b ) } i=1 c(u, y) = max{0, 1 yu} is known as the hinge loss. c(w T x i + b,y i ) associates a mistake cost to the decision w,b for example x i. Nov 26 2010 48

Dual formulation of soft-margin SVM The soft margin SVM program min w,b { w 2 + C n ( max{0, 1 y i w T x i + b ) } i=1 can be rewritten as minimize w 2 + C n ( i=1 ξ i such that y i w T x i + b ) 1 ξ i In that case the dual function n g(α) = α i 1 2 i=1 n i,j=1 α i α j y i y j x T i x j, which is finite under the constraints: { 0 α i C, for i = 1,...,n n i=1 α iy i = 0. Nov 26 2010 49

Interpretation: bounded and unbounded support vectors α=0 α= C 0<α< C Nov 26 2010 50

What about the convex hull analogy? Remember the separable case Here we consider the case where the two sets are not linearly separable, i.e. their convex hulls intersect. Nov 26 2010 51

What about the convex hull analogy? To follow with the convex hull analogy, consider instead the reduced convex hull, Definition 1. Given a set of n points A, and 0 C 1, the set of finite combinations n n λ i x i,1 λ i C, λ i = 1, i=1 is the (C) reduced convex hull of A i=1 Using C = 1/2, the reduced convex hulls of A and B, Nov 26 2010 52

Closest Points in the Reduced Convex Hull Idea: find the closest points for the two reduced convex hulls minimize Au Bv 2 subject to 1 T u = 1 T v = 1 u C1, v C1 0 u R n 1,v R n 1 Again, can prove that the soft-margin SVM with C constant accepts as a dual the formulation above. Nov 26 2010 53

Sometimes linear classifiers are of little use Nov 26 2010 54

Solution: non-linear mapping to a feature space x1 x1 2 R x2 x2 2 Let φ(x) = (x 2 1,x 2 2), w = (1,1) and b = 1. Then the decision function is: f(x) = x 2 1 + x 2 2 R 2 = w,φ(x) + b, Nov 26 2010 55

Kernel trick for SVM s [3] use a mapping φ from X to a feature space, which corresponds to the kernel k: x,x X, k(x,x ) = φ(x), φ(x ) Example: if φ(x) = φ ([ x1 x 2 ]) = [ x 2 1 x 2 2 ], then k(x,x ) = φ(x), φ(x ) = (x 1 ) 2 (x 1) 2 + (x 2 ) 2 (x 2) 2. Nov 26 2010 56

Training a SVM in the feature space Replace each x T x in the SVM algorithm by φ(x), φ(x ) = k(x,x ) Reminder: the dual problem is to maximize g(α) = n α i 1 2 i=1 n i,j=1 α i α j y i y j k(x i, x j ), under the constraints: { 0 α i C, for i = 1,...,n n i=1 α iy i = 0. The decision function becomes: f(x) = w,φ(x) + b n = y i α i k(x i, x) + b. i=1 (1) Nov 26 2010 57

The Kernel Trick [5] The explicit computation of φ(x) is not necessary. The kernel k(x,x ) is enough. the SVM optimization for α works implicitly in the feature space. the SVM is a kernel algorithm: only need to input K and y: maximize g(α) = α T 1 1 2 αt (y T Ky)α such that 0 α i C, for i = 1,...,n n i=1 α iy i = 0. K s positive definiten problem has an optimum the decision function is f( ) = n i=1 α i k(x i, ) + b. Nov 26 2010 58

Kernel example: polynomial kernel For x = (x 1, x 2 ) R 2, let φ(x) = (x 2 1, 2x 1 x 2,x 2 2) R 3 : K(x, x ) = x 2 1x 2 1 + 2x 1 x 2 x 1x 2 + x 2 2x 2 2 = {x 1 x 1 + x 2 x 2} 2 = {x T x } 2. x1 x1 2 R x2 x2 2 Many more: Nov 26 2010 59

Kernels are Trojan Horses onto Linear Models With kernels, complex structures can enter the realm of linear models Nov 26 2010 60

References [1] K.P. Bennett and E.J. Bredensteiner. Duality and Geometry in SVM Classifiers. In Proceedings of the Seventeenth International Conference on Machine Learning, page 64. Morgan Kaufmann Publishers Inc., 2000. [2] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the 5th annual ACM workshop on Computational Learning Theory, pages 144 152. ACM Press, 1992. [3] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20:273, 1995. [4] T. Hastie, R. Tibshirani, and J. Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd edition). Springer Verlag, 2009. [5] Bernhard Schölkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002. Nov 26 2010 61