Statistical Methods for Data Mining

Similar documents
Statistical Methods for SVM

Support Vector Machines

Machine Learning Practice Page 2 of 2 10/28/13

L5 Support Vector Classification

Support Vector Machine (SVM) and Kernel Methods

Linear & nonlinear classifiers

Support Vector Machines for Classification: A Statistical Portrait

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines and Kernel Methods

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Pattern Recognition 2018 Support Vector Machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Linear & nonlinear classifiers

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Support Vector Machines

Deep Learning for Computer Vision

Support Vector Machine (continued)

Lecture 10: A brief introduction to Support Vector Machine

Introduction to Machine Learning

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Machine Learning And Applications: Supervised Learning-SVM

Support Vector Machine (SVM) and Kernel Methods

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Statistical Pattern Recognition

Review: Support vector machines. Machine learning techniques and image analysis

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machine I

Discriminative Models

Machine Learning. Support Vector Machines. Manfred Huber

Jeff Howbert Introduction to Machine Learning Winter

Perceptron Revisited: Linear Separators. Support Vector Machines

CIS 520: Machine Learning Oct 09, Kernel Methods

Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Classification and Support Vector Machine

Discriminative Models

CSC 411 Lecture 17: Support Vector Machine

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Support Vector Machines

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Statistical Data Mining and Machine Learning Hilary Term 2016

SUPPORT VECTOR MACHINE

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Introduction to SVM and RVM

Support Vector Machine

Convex Optimization and Support Vector Machine

9 Classification. 9.1 Linear Classifiers

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Regularization Paths

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Announcements - Homework

Warm up: risk prediction with logistic regression

11 More Regression; Newton s Method; ROC Curves

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machines for Classification and Regression

18.9 SUPPORT VECTOR MACHINES

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

(Kernels +) Support Vector Machines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

CS145: INTRODUCTION TO DATA MINING

COMP 875 Announcements

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

Support Vector Machines

Max Margin-Classifier

Learning by constraints and SVMs (2)

Applied Machine Learning Annalisa Marsico

A Magiv CV Theory for Large-Margin Classifiers

Evaluation requires to define performance measures to be optimized

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Kernel Methods and Support Vector Machines

LEAST ANGLE REGRESSION 469

DATA MINING AND MACHINE LEARNING

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Lecture 10: Support Vector Machine and Large Margin Classifier

Support Vector Machines, Kernel SVM

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machines.

Support Vector Machine

Transcription:

Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn

Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two ways: We soften what we mean by separates, and We enrich and enlarge the feature space so that separation is possible. 1/21

What is a Hyperplane? A hyperplane in p dimensions is a flat a dimension p 1. ne subspace of In general the equation for a hyperplane has the form 0 + 1 X 1 + 2 X 2 +...+ p X p =0 In p = 2 dimensions a hyperplane is a line. If 0 = 0, the hyperplane goes through the origin, otherwise not. The vector =( 1, 2,, p) is called the normal vector it points in a direction orthogonal to the surface of a hyperplane. 2/21

Hyperplane in 2 Dimensions X 2 2 0 2 4 6 8 10 β 1 X 1 +β 2 X 2 6= 4 β 1 X 1 +β 2 X 2 6=0 β=(β 1,β 2 ) β 1 X 1 +β 2 X 2 6=1.6 β 1 = 0.8 β 2 = 0.6 2 0 2 4 6 8 10 X 1 3/21

hyperplane divid p-dimensional space into two halves

A hyperplane in two-dimensional space

Classification Using a Separating Hyperplane

Separating Hyperplanes X2 1 0 1 2 3 X2 1 0 1 2 3 1 0 1 2 3 X 1 1 0 1 2 3 X 1 If f(x) = 0 + 1 X 1 + + p X p,thenf(x) > 0 for points on one side of the hyperplane, and f(x) < 0 for points on the other. If we code the colored points as Y i = +1 for blue, say, and Y i = 1 for mauve, then if Y i f(x i ) > 0 for all i, f(x) =0 defines a separating hyperplane. 4/21

Maximal Margin Classifier Among all separating hyperplanes, find the one that makes the biggest gap or margin between the two classes. X2 1 0 1 2 3 Constrained optimization problem maximize 0, 1,..., p M subject to px j=1 2 j =1, y i ( 0 + 1 x i1 +...+ p x ip ) M for all i =1,...,N. 1 0 1 2 3 X1 5/21

Maximal Margin Classifier Among all separating hyperplanes, find the one that makes the biggest gap or margin between the two classes. X2 1 0 1 2 3 Constrained optimization problem maximize 0, 1,..., p M subject to px j=1 2 j =1, y i ( 0 + 1 x i1 +...+ p x ip ) M for all i =1,...,N. 1 0 1 2 3 X1 This can be rephrased as a convex quadratic program, and solved e ciently. The function svm() in package e1071 solves this problem e ciently 5/21

Maximal margin classifier

Maximal margin classifier

Maximal margin classifier

Constrained Optimization

Lagrange Multiplier

Lagrange Multiplier

Lagrange Multiplier

Lagrange Multiplier

Quadratic Programming

Quadratic Programming

Quadratic Programming

Quadratic Programming

Support Vectors

Non-separable Data X2 1.0 0.5 0.0 0.5 1.0 1.5 2.0 The data on the left are not separable by a linear boundary. This is often the case, unless N<p. 0 1 2 3 X1 6/21

Noisy Data X2 1 0 1 2 3 X2 1 0 1 2 3 1 0 1 2 3 X 1 1 0 1 2 3 X 1 Sometimes the data are separable, but noisy. This can lead to a poor solution for the maximal-margin classifier. 7/21

Noisy Data X2 1 0 1 2 3 X2 1 0 1 2 3 1 0 1 2 3 X 1 1 0 1 2 3 X 1 Sometimes the data are separable, but noisy. This can lead to a poor solution for the maximal-margin classifier. The support vector classifier maximizes a soft margin. 7/21

Support Vector Classifier X2 1 0 1 2 3 4 3 6 4 5 7 1 2 8 9 10 X2 1 0 1 2 3 4 3 6 12 4 5 7 11 1 2 8 9 10 0.5 0.0 0.5 1.0 1.5 2.0 2.5 X 1 0.5 0.0 0.5 1.0 1.5 2.0 2.5 X 1 maximize M subject to 0, 1,..., p, 1,..., n px j=1 2 j =1, y i ( 0 + 1 x i1 + 2 x i2 +...+ p x ip ) M(1 i ), nx i 0, i apple C, i=1 8/21

Maximal margin classifier

C is a regularization parameter 3 2 1 0 1 2 3 3 2 1 0 1 2 3 1 0 1 2 X1 1 0 1 2 X1 X2 3 2 1 0 1 2 3 X2 3 2 1 0 1 2 3 X2 X2 1 0 1 2 X1 1 0 1 2 X1 9/21

Support vector classifier we can drop the norm constraint on, define M = 1 the equivalent form This is the usual way the support vector classifier is defined for the non- separable case. points well inside their class boundary do not play a big role in shaping the boundary.

Computing the Support Vector Classifier

Computing the Support Vector Classifier

Computing the Support Vector Classifier

Computing the Support Vector Classifier

The SVM as a Penalization Method The standard SVM: min 0,, 1 2 k P k2 + C n i i=1 subject to i 0,y i ( 0 + x > i ) 1 i,i=1, 2,,n. (1) where =( 1, 2,, n ) are the slack variables and C>0 is a constant. The constraint of (1) can be written as Then Objective function: 1 2 k k2 + C i [1 y i ( 0 + x > i )] +, i =1, 2,,n nx i=1 i 1 2 k k2 + C nx [1 y i ( 0 + x > i )] + If and only if i =[1 y i ( 0 + x > i )] +, the above inequality takes =. i=1

The SVM as a Penalization Method Then the solution of 0, in (1) is the same as the solution of the following optimization problem : min 0, 1 2 k k2 + C nx [1 y i ( 0 + x > i )] + i=1 Let =1/C,The above optimization problem is equivalent to min 0, nx [1 y i ( 0 + x > i )] + + k k 2. (2) 2 i=1 This has the form loss + penalty, which is a familiar paradigm in function estimation.

Implication of the Hinge Loss Consider the expectation of the loss function: E(`(yf) x) =`(f) Pr(y =1 x) +`( f) Pr(y = 1 x) Minimize expected loss when `(.) is hinge loss or 0-1 loss, obtain: f(x) =sign Pr(y =1 x) 1 2 0-1 loss is the true loss function in binary classification. But the following optimization problem is di cult. 1 np min I [yi f(x f n i )<0] i=1 Since the solution to minimize the expected loss is the same, the hinge loss is also called the surrogate loss function. Compared with the perceptron loss, the hinge loss is not only classified correctly, but also the loss is only 0 when the confidence is high enough, that is, the hinge loss has higher requirements for learning.

Compare with other commonly used loss functions SVM Hinge Loss: [1 yf(x)] + Logistic Regression Loss: log[1 + e yf(x) ] Squared Error: [y f(x)] 2 =[1 yf(x)] 2 Huberized 8 Hinge Loss: < 0, yf > 1 `(yf) = (1 yf) 2 /(2 ), 1 <yfapple 1 : 1 t /2, yf apple 1. where >0 is a pre-specfied constant. Examination of the hinge loss function shows that it is reasonable for two-class classification, when compared to Logistic loss and Squared error. Logistic loss has similar tails as the SVM loss, but there is never zero penalty for points well inside their margin. Squared error gives a quadratic penalty, and points well inside their own margin have a strong influence on the model as well. Rosset and Zhu (2007) proposed a Huberized hinge loss, whichisvery similar to hinge loss in shape.

Hinge Loss and Huberized Loss As the decreases, the hubrized loss gets closer to the hinge loss. In fact, when! 0, the limit of hubrized loss is the hinge loss. Unlike the hinge loss, the huberized loss function is di erentiable everywhere. It is easy to prove that the huberized loss is satisfied arg min E[`(yf) x] =2Pr(y =1 x) f E(y x) > 0, sign[pr(y =1 x) 1 2 ] > 0. 1=E(y x) Rosset and Ji Zhu(2007): In the model with logistic loss, huberized loss, and square hinge loss with `1-penalty, the impact of outliers on the received huberized loss model is minimal.

`1-SVM The penalty of SVM is called the ridge penalty. It is well known that this shrinkage has the e ect of controlling the variances of ˆ; Hence possibly improves the fitted models prediction accuracy, especially when there are many highly correlated features. The SVM uses all variables in classification, which could be a great disadvantage in high-dimensional classification. Apply lasso penalty (`1-norm) to the SVM, Consider the optimization problem(`1-svm) 1 np min [1 y i ( 0 + x > i )] 0, n + + k k 1. i=1? the `1-SVM is able to automatically discard irrelevant features.? In the presence of many noise variables, the `1-SVM has significant advantages over the standard SVM.

`1-SVM The optimization problem of `1-SVM is a linear program with many constraints, and e cient algorithms can be complex. The solution paths can have many jumps, and show many discontinuities. For this reason, some authors prefer to replace the usual hinge loss with other smooth loss function, such as square hinge loss, huberized hinge loss, etc. (Zhu, Rosset, Hastie, 2004). The number of variables selected by the `1-norm penalty is upper bounded by the sample size n. For highly correlated and relevant variables, the `1-norm penalty tends to select only one or a few of them.

The hybrid huberized SVM(HHSVM) Li Wang, Ji Zhu and Hui Zou(2007) apply the elastic-net penalty to the SVM and propose the HHSVM: min ( 0, ) nx i=1 `(y i ( 0 + x > i )) + 1 k k 1 + 2 2 k k2 2. (3) Where the loss function is huberized hinge loss: 8 < 0, yf > 1 `(yf) = (1 yf) 2 /(2 ), 1 <yfapple 1 : 1 yf /2, yf apple 1, and >0 is a pre-specified constant. The delfault choice for is 2 unless specified otherwise.

The hybrid huberized SVM(HHSVM) The advantage of Elastic-net penalty: 1. The elastic-net penalty retains the variable selection feature of the `1-norm penalty, and the number of selected variables is no longer bounded by n. 2. The elastic-net penalty tends to provide similar estimated coe cients for highly correlated variables. Theorem Let ˆ0 and ˆ denote the solution for problem (3). For any pair (j, j 0 ),wehave ˆj ˆj0 apple 1 kx j x j 0k = 1 2 2 nx x ij x ij 0 i=1 If the input vector x j and x j 0 are centered and normalized, then ˆj ˆj0 apple 1 p n p kx j x j 0k = 2(1 ) 2 2 Where is the sample correlation between x j and x j 0.

The hybrid huberized SVM(HHSVM) GCD algorithm(yi Yang and Hui Zou(2014)) Define r i = y i ( 0 + x > i ), p 1, 2 ( j)= 1 j + 2 2 j, F ( j 0, ) = 1 n nx `(r i + y i x ij ( j j)) + p 1, ( 2 j) (4) i=1 Approximate the F function in (4) by a penalized quadratic function defined as Q( j 0, ) = 1 n nx `(r i )+ 1 n i=1 nx `0(r i )y i x ij ( j i=1 + 1 ( j j) 2 + p 1, 2 ( j) j) (5)

The hybrid huberized SVM(HHSVM) GCD algorithm(yi Yang and Hui Zou(2014)) By a simple soft-thresholding rule, can easily solve the minimizer of (5): np S( 2 1 j ˆC j = arg min Q( j 0, ) n `0(r i )y i x ij, 1) i=1 (6) =. 2/ + 2 j where S(z,t) =( z t) + sgn(z). Similarly, consider minimizing a quadratic approximation Q( 0 0, ) = 1 nx `(r i )+ 1 nx `0(r i )y i ( 0 0)+ 1 ( 0 0) 2. n n i=1 which has a minimizer ˆC 0 i=1 = arg min Q( 0 0, ) = 0 0 2 np `0(r i )y i. n i=1 (7) (8)

The hybrid huberized SVM(HHSVM) Algorithm: The GCD algorithm for the HHSVM. 1. Initialize ( 0, ). 2. Iterate (a)(b) until convergence (a) Cyclic coordinate descent: for j =1, 2,,p. (a.1) Compute r i = y i( 0 + x > ). i (a.2) Compute r i = y i( 0 + x > ). i S( 2 j ˆCj = 1 n (a.3) j = ˆC j, j = j +1. (b) Update the intercept term (b.1) Compute r i = y i( 0 + x > i (b.2) Compute (b.3) 0 = ˆC 0. ˆC0 = 0 np `0(r i)y ix ij, 1) i=1. 2/ + 2 2 ). np `0(r i)y i. n i=1

Linear boundary can fail X2 4 2 0 2 4 Sometime a linear boundary simply won t work, no matter what value of C. The example on the left is such a case. What to do? 4 2 0 2 4 X 1 10 / 21

Feature Expansion Enlarge the space of features by including transformations; e.g. X1 2, X3 1, X 1X 2, X 1 X2 2,... Hence go from a p-dimensional space to a M>pdimensional space. Fit a support-vector classifier in the enlarged space. This results in non-linear decision boundaries in the original space. 11 / 21

Feature Expansion Enlarge the space of features by including transformations; e.g. X1 2, X3 1, X 1X 2, X 1 X2 2,... Hence go from a p-dimensional space to a M>pdimensional space. Fit a support-vector classifier in the enlarged space. This results in non-linear decision boundaries in the original space. Example: Suppose we use (X 1,X 2,X 2 1,X2 2,X 1X 2 ) instead of just (X 1,X 2 ). Then the decision boundary would be of the form 0 + 1 X 1 + 2 X 2 + 3 X 2 1 + 4 X 2 2 + 5 X 1 X 2 =0 This leads to nonlinear decision boundaries in the original space (quadratic conic sections). 11 / 21

hyperplane divid p-dimensional space into two halves

Here we use a basis expansion of cubic polynomials From 2 variables to 9 The support-vector classifier in the enlarged space solves the problem in the lower-dimensional space Cubic Polynomials X2 4 2 0 2 4 X2 4 2 0 2 4 4 2 0 2 4 X 1 12 / 21

Here we use a basis expansion of cubic polynomials From 2 variables to 9 The support-vector classifier in the enlarged space solves the problem in the lower-dimensional space Cubic Polynomials X2 4 2 0 2 4 X2 4 2 0 2 4 4 2 0 2 4 X 1 0+ 1 X 1 + 2 X 2 + 3 X 2 1 + 4X 2 2 + 5X 1 X 2 + 6 X 3 1 + 7X 3 2 + 8X 1 X 2 2 + 9X 2 1 X 2 =0 12 / 21

Nonlinearities and Kernels Polynomials (especially high-dimensional ones) get wild rather fast. There is a more elegant and controlled way to introduce nonlinearities in support-vector classifiers through the use of kernels. Before we discuss these, we must understand the role of inner products in support-vector classifiers. 13 / 21

Inner products and support vectors px hx i,x i 0i = x ij x i 0 j j=1 inner product between vectors 14 / 21

Inner products and support vectors px hx i,x i 0i = x ij x i 0 j j=1 inner product between vectors The linear support vector classifier can be represented as nx f(x) = 0 + i hx, x i i n parameters i=1 14 / 21

Inner products and support vectors px hx i,x i 0i = x ij x i 0 j j=1 inner product between vectors The linear support vector classifier can be represented as nx f(x) = 0 + i hx, x i i n parameters i=1 To estimate the parameters 1,..., n and 0, all we need are the n 2 inner products hx i,x i 0i between all pairs of training observations. 14 / 21

Inner products and support vectors px hx i,x i 0i = x ij x i 0 j j=1 inner product between vectors The linear support vector classifier can be represented as nx f(x) = 0 + i hx, x i i n parameters i=1 To estimate the parameters 1,..., n and 0, all we need are the n 2 inner products hx i,x i 0i between all pairs of training observations. It turns out that most of the ˆ i can be zero: f(x) = 0 + X i2s ˆ i hx, x i i S is the support set of indices i such that ˆ i > 0. [see slide 8] 14 / 21

Kernels and Support Vector Machines If we can compute inner-products between observations, we can fit a SV classifier. Can be quite abstract! 15 / 21

Kernels and Support Vector Machines If we can compute inner-products between observations, we can fit a SV classifier. Can be quite abstract! Some special kernel functions can do this for us. E.g. 0 K(x i,x i 0)= @1+ 1 px x ij x i A 0 j j=1 computes the inner-products needed for d dimensional polynomials basis functions! p+d d d 15 / 21

Kernels and Support Vector Machines If we can compute inner-products between observations, we can fit a SV classifier. Can be quite abstract! Some special kernel functions can do this for us. E.g. 0 K(x i,x i 0)= @1+ 1 px x ij x i A 0 j j=1 computes the inner-products needed for d dimensional polynomials p+d d basis functions! Try it for p =2and d =2. d 15 / 21

Kernels and Support Vector Machines If we can compute inner-products between observations, we can fit a SV classifier. Can be quite abstract! Some special kernel functions can do this for us. E.g. 0 K(x i,x i 0)= @1+ 1 px x ij x i A 0 j j=1 computes the inner-products needed for d dimensional polynomials p+d d basis functions! Try it for p =2and d =2. The solution has the form d f(x) = 0 + X i2s ˆ i K(x, x i ). 15 / 21

Radial Kernel K(x i,x i 0)=exp( px (x ij x i 0 j) 2 ). j=1 X2 4 2 0 2 4 f(x) = 0 + X i2s ˆ i K(x, x i ) Implicit feature space; very high dimensional. Controls variance by squashing down most dimensions severely 4 2 0 2 4 X 1 16 / 21

Example: Heart Data True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 Support Vector Classifier LDA True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 Support Vector Classifier SVM: γ=10 3 SVM: γ=10 2 SVM: γ=10 1 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate ROC curve is obtained by changing the threshold 0 to threshold t in ˆf(X) >t, and recording false positive and true positive rates as t varies. Here we see ROC curves on training data. 17 / 21

ROC curve

ROC

ROC curve

SVMs: more than 2 classes? The SVM as defined works for K = 2 classes. What do we do if we have K>2 classes? 19 / 21

SVMs: more than 2 classes? The SVM as defined works for K = 2 classes. What do we do if we have K>2 classes? OVA One versus All. Fit K di erent 2-class SVM classifiers ˆf k (x), k=1,...,k; each class versus the rest. Classify x to the class for which ˆf k (x ) is largest. 19 / 21

SVMs: more than 2 classes? The SVM as defined works for K = 2 classes. What do we do if we have K>2 classes? OVA One versus All. Fit K di erent 2-class SVM classifiers ˆf k (x), k=1,...,k; each class versus the rest. Classify x to the class for which ˆf k (x ) is largest. K OVO One versus One. Fit all 2 pairwise classifiers ˆf k`(x). Classify x to the class that wins the most pairwise competitions. 19 / 21

SVMs: more than 2 classes? The SVM as defined works for K = 2 classes. What do we do if we have K>2 classes? OVA One versus All. Fit K di erent 2-class SVM classifiers ˆf k (x), k=1,...,k; each class versus the rest. Classify x to the class for which ˆf k (x ) is largest. K OVO One versus One. Fit all 2 pairwise classifiers ˆf k`(x). Classify x to the class that wins the most pairwise competitions. Which to choose? If K is not too large, use OVO. 19 / 21

Support Vector versus Logistic Regression? With f(x) = 0 + 1 X 1 +...+ p X p can rephrase support-vector classifier optimization as 8 < nx minimize max [0, 1 y 0, 1,..., p : i f(x i )] + i=1 px j=1 2 j 9 = ; Loss 0 2 4 6 8 SVM Loss Logistic Regression Loss This has the form loss plus penalty. The loss is known as the hinge loss. Very similar to loss in logistic regression (negative log-likelihood). 6 4 2 0 2 y i( 0 + 1x i1 +...+ px ip) 20 / 21

Which to use: SVM or Logistic Regression When classes are (nearly) separable, SVM does better than LR. So does LDA. When not, LR (with ridge penalty) and SVM very similar. If you wish to estimate probabilities, LR is the choice. For nonlinear boundaries, kernel SVMs are popular. Can use kernels with LR and LDA as well, but computations are more expensive. 21 / 21

Support vector regression

Support vector regression

Support vector regression

Support vector regression

Support vector regression

Support vector regression