Statistical Methods for SVM

Similar documents
Support Vector Machines

Statistical Methods for Data Mining

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

L5 Support Vector Classification

Support Vector Machine (SVM) and Kernel Methods

Machine Learning Practice Page 2 of 2 10/28/13

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Pattern Recognition 2018 Support Vector Machines

CIS 520: Machine Learning Oct 09, Kernel Methods

Review: Support vector machines. Machine learning techniques and image analysis

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Linear & nonlinear classifiers

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Linear & nonlinear classifiers

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines and Speaker Verification

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Perceptron Revisited: Linear Separators. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Convex Optimization and Support Vector Machine

Support Vector Machine (continued)

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Introduction to SVM and RVM

Machine Learning And Applications: Supervised Learning-SVM

Support Vector Machine

18.9 SUPPORT VECTOR MACHINES

Support Vector Machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machines and Kernel Methods

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

MATH 829: Introduction to Data Mining and Analysis Support vector machines

Support Vector Machines, Kernel SVM

Statistical Machine Learning from Data

Support Vector and Kernel Methods

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Midterm exam CS 189/289, Fall 2015

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Lecture 10: Support Vector Machine and Large Margin Classifier

Classification and Support Vector Machine

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

SUPPORT VECTOR MACHINE

Lecture 18: Multiclass Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Announcements - Homework

Introduction to Support Vector Machines

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVMs: nonlinearity through kernels

Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Introduction to Signal Detection and Classification. Phani Chavali

Deep Learning for Computer Vision

Kernel Methods and Support Vector Machines

Classification Logistic Regression

Support Vector Machines

Support Vector Machine I

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear, Binary SVM Classifiers

CSC 411 Lecture 17: Support Vector Machine

Modeling Dependence of Daily Stock Prices and Making Predictions of Future Movements

Machine Learning, Fall 2011: Homework 5

Learning by constraints and SVMs (2)

(Kernels +) Support Vector Machines

Support Vector Machine II

Discriminative Models

Machine Learning. Support Vector Machines. Manfred Huber

Introduction to Machine Learning Midterm, Tues April 8

Support Vector Machines.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machines: Maximum Margin Classifiers

Kernel Methods. Outline

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Max Margin-Classifier

Introduction to Support Vector Machines

Discriminative Models

Lecture 7: Kernels for Classification and Regression

Support Vector Machines Explained

Basis Expansion and Nonlinear SVM. Kai Yu

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machines

Transcription:

Statistical Methods for SVM

Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two ways: We soften what we mean by separates, and We enrich and enlarge the feature space so that separation is possible. 1/21

What is a Hyperplane? A hyperplane in p dimensions is a flat a dimension p 1. ne subspace of In general the equation for a hyperplane has the form 0 + 1 X 1 + 2 X 2 +...+ p X p =0 In p = 2 dimensions a hyperplane is a line. If 0 = 0, the hyperplane goes through the origin, otherwise not. The vector =( 1, 2,, p) is called the normal vector it points in a direction orthogonal to the surface of a hyperplane. 2/21

Hyperplane in 2 Dimensions X 2 2 0 2 4 6 8 10 β 1 X 1 +β 2 X 2 6= 4 β 1 X 1 +β 2 X 2 6=0 β=(β 1,β 2 ) β 1 X 1 +β 2 X 2 6=1.6 β 1 = 0.8 β 2 = 0.6 2 0 2 4 6 8 10 X 1 3/21

Separating Hyperplanes X2 1 0 1 2 3 X2 1 0 1 2 3 1 0 1 2 3 X 1 1 0 1 2 3 X 1 If f(x) = 0 + 1 X 1 + + p X p,thenf(x) > 0 for points on one side of the hyperplane, and f(x) < 0 for points on the other. If we code the colored points as Y i = +1 for blue, say, and Y i = 1 for mauve, then if Y i f(x i ) > 0 for all i, f(x) =0 defines a separating hyperplane. 4/21

Maximal Margin Classifier Among all separating hyperplanes, find the one that makes the biggest gap or margin between the two classes. X2 1 0 1 2 3 Constrained optimization problem maximize 0, 1,..., p M subject to px j=1 2 j =1, y i ( 0 + 1 x i1 +...+ p x ip ) M for all i =1,...,N. 1 0 1 2 3 X1 5/21

Maximal Margin Classifier Among all separating hyperplanes, find the one that makes the biggest gap or margin between the two classes. X2 1 0 1 2 3 Constrained optimization problem maximize 0, 1,..., p M subject to px j=1 2 j =1, y i ( 0 + 1 x i1 +...+ p x ip ) M for all i =1,...,N. 1 0 1 2 3 X1 This can be rephrased as a convex quadratic program, and solved e ciently. The function svm() in package e1071 solves this problem e ciently 5/21

Maximal margin classifier We showed that this problem can be more conveniently rephrased as min, 0 s.t.y i (x T i + 0 ) 1, i = 1,, N This is a convex optimization problem (quadratic criterion, linear inequality constraints)

Non-separable Data X2 1.0 0.5 0.0 0.5 1.0 1.5 2.0 The data on the left are not separable by a linear boundary. This is often the case, unless N<p. 0 1 2 3 X1 6/21

Noisy Data X2 1 0 1 2 3 X2 1 0 1 2 3 1 0 1 2 3 X 1 1 0 1 2 3 X 1 Sometimes the data are separable, but noisy. This can lead to a poor solution for the maximal-margin classifier. 7/21

Noisy Data X2 1 0 1 2 3 X2 1 0 1 2 3 1 0 1 2 3 X 1 1 0 1 2 3 X 1 Sometimes the data are separable, but noisy. This can lead to a poor solution for the maximal-margin classifier. The support vector classifier maximizes a soft margin. 7/21

Support Vector Classifier X2 1 0 1 2 3 4 3 6 4 5 7 1 2 8 9 10 X2 1 0 1 2 3 4 3 6 12 4 5 7 11 1 2 8 9 10 0.5 0.0 0.5 1.0 1.5 2.0 2.5 X 1 0.5 0.0 0.5 1.0 1.5 2.0 2.5 X 1 maximize M subject to 0, 1,..., p, 1,..., n px j=1 2 j =1, y i ( 0 + 1 x i1 + 2 x i2 +...+ p x ip ) M(1 i ), nx i 0, i apple C, i=1 8/21

Maximal margin classifier

C is a regularization parameter 3 2 1 0 1 2 3 3 2 1 0 1 2 3 1 0 1 2 X1 1 0 1 2 X1 X2 3 2 1 0 1 2 3 X2 3 2 1 0 1 2 3 X2 X2 1 0 1 2 X1 1 0 1 2 X1 9/21

Support vector classifier we can drop the norm constraint on, define M = 1 the equivalent form This is the usual way the support vector classifier is defined for the non- separable case. points well inside their class boundary do not play a big role in shaping the boundary.

Computing the Support Vector Classifier

Computing the Support Vector Classifier

Computing the Support Vector Classifier

Computing the Support Vector Classifier

Linear boundary can fail X2 4 2 0 2 4 Sometime a linear boundary simply won t work, no matter what value of C. The example on the left is such a case. What to do? 4 2 0 2 4 X 1 10 / 21

Feature Expansion Enlarge the space of features by including transformations; e.g. X1 2, X3 1, X 1X 2, X 1 X2 2,... Hence go from a p-dimensional space to a M>pdimensional space. Fit a support-vector classifier in the enlarged space. This results in non-linear decision boundaries in the original space. 11 / 21

Feature Expansion Enlarge the space of features by including transformations; e.g. X1 2, X3 1, X 1X 2, X 1 X2 2,... Hence go from a p-dimensional space to a M>pdimensional space. Fit a support-vector classifier in the enlarged space. This results in non-linear decision boundaries in the original space. Example: Suppose we use (X 1,X 2,X 2 1,X2 2,X 1X 2 ) instead of just (X 1,X 2 ). Then the decision boundary would be of the form 0 + 1 X 1 + 2 X 2 + 3 X 2 1 + 4 X 2 2 + 5 X 1 X 2 =0 This leads to nonlinear decision boundaries in the original space (quadratic conic sections). 11 / 21

Here we use a basis expansion of cubic polynomials From 2 variables to 9 The support-vector classifier in the enlarged space solves the problem in the lower-dimensional space Cubic Polynomials X2 4 2 0 2 4 X2 4 2 0 2 4 4 2 0 2 4 X 1 12 / 21

Here we use a basis expansion of cubic polynomials From 2 variables to 9 The support-vector classifier in the enlarged space solves the problem in the lower-dimensional space Cubic Polynomials X2 4 2 0 2 4 X2 4 2 0 2 4 4 2 0 2 4 X 1 0+ 1 X 1 + 2 X 2 + 3 X 2 1 + 4X 2 2 + 5X 1 X 2 + 6 X 3 1 + 7X 3 2 + 8X 1 X 2 2 + 9X 2 1 X 2 =0 12 / 21

Nonlinearities and Kernels Polynomials (especially high-dimensional ones) get wild rather fast. There is a more elegant and controlled way to introduce nonlinearities in support-vector classifiers through the use of kernels. Before we discuss these, we must understand the role of inner products in support-vector classifiers. 13 / 21

Inner products and support vectors px hx i,x i 0i = x ij x i 0 j j=1 inner product between vectors 14 / 21

Inner products and support vectors px hx i,x i 0i = x ij x i 0 j j=1 inner product between vectors The linear support vector classifier can be represented as nx f(x) = 0 + i hx, x i i n parameters i=1 14 / 21

Inner products and support vectors px hx i,x i 0i = x ij x i 0 j j=1 inner product between vectors The linear support vector classifier can be represented as nx f(x) = 0 + i hx, x i i n parameters i=1 To estimate the parameters 1,..., n and 0, all we need are the n 2 inner products hx i,x i 0i between all pairs of training observations. 14 / 21

Inner products and support vectors px hx i,x i 0i = x ij x i 0 j j=1 inner product between vectors The linear support vector classifier can be represented as nx f(x) = 0 + i hx, x i i n parameters i=1 To estimate the parameters 1,..., n and 0, all we need are the n 2 inner products hx i,x i 0i between all pairs of training observations. It turns out that most of the ˆ i can be zero: f(x) = 0 + X i2s ˆ i hx, x i i S is the support set of indices i such that ˆ i > 0. [see slide 8] 14 / 21

Kernels and Support Vector Machines If we can compute inner-products between observations, we can fit a SV classifier. Can be quite abstract! 15 / 21

Kernels and Support Vector Machines If we can compute inner-products between observations, we can fit a SV classifier. Can be quite abstract! Some special kernel functions can do this for us. E.g. 0 K(x i,x i 0)= @1+ 1 px x ij x i A 0 j j=1 computes the inner-products needed for d dimensional polynomials basis functions! p+d d d 15 / 21

Kernels and Support Vector Machines If we can compute inner-products between observations, we can fit a SV classifier. Can be quite abstract! Some special kernel functions can do this for us. E.g. 0 K(x i,x i 0)= @1+ 1 px x ij x i A 0 j j=1 computes the inner-products needed for d dimensional polynomials p+d d basis functions! Try it for p =2and d =2. d 15 / 21

Kernels and Support Vector Machines If we can compute inner-products between observations, we can fit a SV classifier. Can be quite abstract! Some special kernel functions can do this for us. E.g. 0 K(x i,x i 0)= @1+ 1 px x ij x i A 0 j j=1 computes the inner-products needed for d dimensional polynomials p+d d basis functions! Try it for p =2and d =2. The solution has the form d f(x) = 0 + X i2s ˆ i K(x, x i ). 15 / 21

Radial Kernel K(x i,x i 0)=exp( px (x ij x i 0 j) 2 ). j=1 X2 4 2 0 2 4 f(x) = 0 + X i2s ˆ i K(x, x i ) Implicit feature space; very high dimensional. Controls variance by squashing down most dimensions severely 4 2 0 2 4 X 1 16 / 21

Example: Heart Data True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 Support Vector Classifier LDA True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 Support Vector Classifier SVM: γ=10 3 SVM: γ=10 2 SVM: γ=10 1 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate ROC curve is obtained by changing the threshold 0 to threshold t in ˆf(X) >t, and recording false positive and true positive rates as t varies. Here we see ROC curves on training data. 17 / 21

ROC curve

ROC curve

Example continued: Heart Test Data True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 Support Vector Classifier LDA True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 Support Vector Classifier SVM: γ=10 3 SVM: γ=10 2 SVM: γ=10 1 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate 18 / 21

SVMs: more than 2 classes? The SVM as defined works for K = 2 classes. What do we do if we have K>2 classes? 19 / 21

SVMs: more than 2 classes? The SVM as defined works for K = 2 classes. What do we do if we have K>2 classes? OVA One versus All. Fit K di erent 2-class SVM classifiers ˆf k (x), k=1,...,k; each class versus the rest. Classify x to the class for which ˆf k (x ) is largest. 19 / 21

SVMs: more than 2 classes? The SVM as defined works for K = 2 classes. What do we do if we have K>2 classes? OVA One versus All. Fit K di erent 2-class SVM classifiers ˆf k (x), k=1,...,k; each class versus the rest. Classify x to the class for which ˆf k (x ) is largest. K OVO One versus One. Fit all 2 pairwise classifiers ˆf k`(x). Classify x to the class that wins the most pairwise competitions. 19 / 21

SVMs: more than 2 classes? The SVM as defined works for K = 2 classes. What do we do if we have K>2 classes? OVA One versus All. Fit K di erent 2-class SVM classifiers ˆf k (x), k=1,...,k; each class versus the rest. Classify x to the class for which ˆf k (x ) is largest. K OVO One versus One. Fit all 2 pairwise classifiers ˆf k`(x). Classify x to the class that wins the most pairwise competitions. Which to choose? If K is not too large, use OVO. 19 / 21

Support Vector versus Logistic Regression? With f(x) = 0 + 1 X 1 +...+ p X p can rephrase support-vector classifier optimization as 8 < nx minimize max [0, 1 y 0, 1,..., p : i f(x i )] + i=1 px j=1 2 j 9 = ; Loss 0 2 4 6 8 SVM Loss Logistic Regression Loss This has the form loss plus penalty. The loss is known as the hinge loss. Very similar to loss in logistic regression (negative log-likelihood). 6 4 2 0 2 y i( 0 + 1x i1 +...+ px ip) 20 / 21

Which to use: SVM or Logistic Regression When classes are (nearly) separable, SVM does better than LR. So does LDA. When not, LR (with ridge penalty) and SVM very similar. If you wish to estimate probabilities, LR is the choice. For nonlinear boundaries, kernel SVMs are popular. Can use kernels with LR and LDA as well, but computations are more expensive. 21 / 21

Support vector regression

Support vector regression

Support vector regression

Support vector regression

Support vector regression

Support vector regression