Lecture 18: Multiclass Support Vector Machines

Similar documents
Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning

Institute of Statistics Mimeo Series No Variable Selection for Multicategory SVM via Sup-Norm Regularization

Statistical Methods for SVM

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Support Vector Machines for Classification: A Statistical Portrait

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Support Vector Machine (continued)

A Bahadur Representation of the Linear Support Vector Machine

Multicategory Vertex Discriminant Analysis for High-Dimensional Data

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

18.9 SUPPORT VECTOR MACHINES

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Support Vector Machines

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Statistical Properties of Large Margin Classifiers

Does Modeling Lead to More Accurate Classification?

Lecture 10: Support Vector Machine and Large Margin Classifier

Introduction to Support Vector Machines

A Study of Relative Efficiency and Robustness of Classification Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machine. Industrial AI Lab.

Kernel Methods and Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Fisher Consistency of Multicategory Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Lecture 3: Statistical Decision Theory (Part II)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Classification and Support Vector Machine

Reinforced Multicategory Support Vector Machines

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Support Vector Machines

A NEW MULTI-CLASS SUPPORT VECTOR ALGORITHM

Multiclass Classification-1

Trading Convexity for Scalability

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Introduction to Logistic Regression and Support Vector Machine

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector and Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods

Support Vector Machines

Support Vector Machine

Pattern Recognition 2018 Support Vector Machines

Statistical Methods for Data Mining

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines, Kernel SVM

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Support Vector Machines

Support Vector Machine for Classification and Regression

CS-E4830 Kernel Methods in Machine Learning

Discriminative Learning and Big Data

Support vector machines Lecture 4

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Machine Learning for NLP

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

STA141C: Big Data & High Performance Statistical Computing

Discriminative Models

Support vector machines

Linear Classification

Support Vector Machines

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Classification and Pattern Recognition

CSC 411 Lecture 17: Support Vector Machine

Announcements - Homework

Linear & nonlinear classifiers

Reproducing Kernel Hilbert Spaces

Introduction to Machine Learning

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Approximation Theoretical Questions for SVMs

TDT4173 Machine Learning

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

c 4, < y 2, 1 0, otherwise,

Bayesian Support Vector Machines for Feature Ranking and Selection

Machine learning for pervasive systems Classification in high-dimensional spaces

Sparse Kernel Machines - SVM

Linear & nonlinear classifiers

Training Support Vector Machines: Status and Challenges

Max Margin-Classifier

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

L5 Support Vector Classification

Lecture 11. Linear Soft Margin Support Vector Machines

Transcription:

Fall, 2017

Outlines Overview of Multiclass Learning Traditional Methods for Multiclass Problems One-vs-rest approaches Pairwise approaches Recent development for Multiclass Problems Simultaneous Classification Various loss functions

Multiclass Classification Setup Traditional Methods for Multiclass Problems Label: { 1, +1} {1, 2,..., K}. Classification decision rule: f : R d = {1, 2,..., K}. Classification accuracy is measured by Equal-cost: Generalization Error (GE) Unequal-cost: the risk Err(f) = P(Y f (X)). R(f ) = E Y,X C(Y, f (X)).

Traditional Methods Traditional Methods for Multiclass Problems Main ideas: (i) Decompose the multiclass classification problem into multiple binary classification problems. (ii) Use the majority voting principle (a combined decision from the committee) to predict the label Common approaches: simple but effective One-vs-rest (one-vs-all) approaches Pairwise (one-vs-one, all-vs-all) approaches

One-vs-rest Approach Traditional Methods for Multiclass Problems One of the simplest multiclass classifier; commonly used in SVMs; also known as the one-vs-all (OVA) approach (i) Solve K different binary problems: classify class k versus the rest classes for k = 1,, K. (ii) Assign a test sample to the class giving the largest f k (x) (most positive) value, where f k (x) is the solution from the kth problem Properties: Very simple to implement, perform well in practice Not optimal (asymptotically): the decision rule is not Fisher consistent if there is no dominating class (i.e. arg max p k (x) < 1 2 ). Read: Rifkin and Klautau (2004) In Defense of One-vs-all Classification

Pairwise Approach Traditional Methods for Multiclass Problems Also known as all-vs-all (AVA) approach (i) Solve ( K 2) different binary problems: classify class k versus class j for all j k. Each classifier is called g ij. (ii) For prediction at a point, each classifier is queried once and issues a vote. The class with the maximum number of (weighted) votes is the winner. Properties: Training process is efficient, by dealing with small binary problems. If K is big, there are too many problems to solve. If K = 10, we need to train 45 binary classifiers. Simple to implement; perform competitively in practice. Read: Park and Furnkranz (2007) Efficient Pairwise Classification

One Single SVM approach: Simultaneous Classification Label: { 1, +1} {1, 2,..., K}. Use one single SVM to construct a decision function vector Classifier (Decision rule): f = (f 1,..., f K ). f (x) = argmax k=1,,k f k (x). If K = 2, there is one f k and the decision rule is sign(f k ). In some sense, multiple logistic regression is a simultaneous classification procedure

Class 1 f1>max(f2,f3) x2 f3>max(f1,f2) f2>max(f1,f3) Class 2 Class 3 x1

Ployhedron one f1 f2=1 f1 f3=1 f1 f2=0 f1 f3=0 f2 f1=1 f3 f1=1 Ployhedron two f2 f3=1 f2 f3=0 f3 f2=1 Ployhedron three

SVM for Multiclass Problem Multiclass SVM: solving one single regularization problem by imposing a penalty on the values of f y (x) f l (x) s. Weston and Watkins (1999) Cramer and Singer (2002) Lee et al. (2004) Liu and Shen (2006); multiclass ψ-learning: Shen et al. (2003)

Various Multiclass SVMs Weston and Watkins (1999): a penalty is imposed only if f y (x) < f k (x) + 2 for k y. Even if f y (x) < 1, a penalty is not imposed as long as f k (x) is sufficiently small for k y; Similarly, if f k (x) > 1 for k y, we do not pay a penalty if f y (x) is sufficiently large. L(y, f(x)) = k y [2 (f y (x) f k (x))] +. Lee et al. (2004): L(y, f(x)) = k y [f k(x) + 1] +. Crammer and Singer (2002): Liu and Shen (2006), L(y, f(x)) = [1 min k {f y (x) f k (x)}] +. To avoid the redundancy, a sum-to-zero constraint K k=1 f k = 0 is sometimes enforced.

Linear Multiclass SVMs For linear classification problems, we have f k (x) = β k x + β 0k, k = 1,, K. The sum-to-zero constraint can be replaced by K β 0k = 0, k=1 K β k = 0. k=1 The optimization problem becomes min f n K L (y i, f(x i )) + λ β k 2 i=1 k=1 subject to the sum-to-zero constraint.

Nonlinear Multiclass SVMs To achieve the nonlinear classification, we assume f k (x) = β k φ(x) + β k0, k = 1,, K. where φ(x) represents the basis functions in the feature space F. Similar to the binary classification, the nonlinear MSVM can be conveniently solved using a kernel function.

Regularization Problems for Nonlinear MSVMs We can represent the MSVM as the solution to a regularization problem in the RKHS. Assume that f(x) = (f 1 (x),, f K (x)) under the sum-to-zero constraint. K ({1} + H k ) k=1 Then a MSVM classifier can be derived by solving min f n K L (y i, f(x i )) + λ g k 2 H k, i=1 k=1 where f k (x) = g k (x) + β 0k, g k H k, β 0k R.

Given (x, y), a reasonable decision vector f(x) should encourage a large value for f y (x) have small values for f k (x), k y. Define the K 1-vector of relative differences as g = (f y (x) f 1 (x),, f y (x) f y 1 (x), f y (x) f y+1 (x),, f y (x) f K (x)). Liu et al. (2004) called the vector g the generalized functional margin of f g characterizes correctness and strength of classification of x by f. f indicates a correct classification of (x, y) if g(f(x), y) > 0 k 1.

0-1 Loss with Functional Margin A point (x, y) is misclassified if y arg max k f k (x). Define the multivariate sign function as sign(u) = 1 if u min = min(u 1,, u m ) > 0, 1 if u min 0. where u = (u 1,, u m ). Using the functional margin, The 0-1 loss becomes I (min g(f (x), y) < 0) = 1 [1 sign(g(f (x), y))]. 2 The GE becomes to R[f ] = 1 E [1 sign(g(f (x), y))]. 2

Generalized Loss Functions Using Functional Margin A natural way to generalize the binary loss is n l(min g(f (x i ), y i )). i=1 In particular, the loss function L(y, f(x)) can be expressed as V (g(f(x), y)) with Weston and Watkins (1999): V (u) = K 1 j=1 [2 u j] +. Lee et al. (2004): V (u) = K 1 K u j + 1] +. Liu and Shen (2006): V (u) = [1 min j u j ] +. j=1 [ K 1 c=1 uc All of these loss functions are the upper bounds of the 0-1 loss.

Small Round Blue Cell Tumors of Childhood Khan et al. (2001) in Nature Medicine Tumor types: neuroblastoma (NB), rhabdomyosarcoma (RMS), non-hodgkin lymphoma (NHL) and the Ewing family of tumors (EWS) Number of genes : 2308 Class distribution of data set Data set EWS BL(NHL) NB RMS total Training set 23 8 12 20 63 Test set 6 3 6 5 20 Total 29 11 18 25 83

10 PC3 5 0 5 EWS BL NB RMS EWS BL NB RMS NON 10 10 0 5 10 10 5 0 PC2 20 10 PC1 Figure 3: Three principal components of 100 gene expression levels (circles: training samples, squares: test samples including non SRBCT samples). The tumor types are distinguished by colors.

Characteristics of Support Vector Machines High accuracy, high flexibility Naturally handle large dimensional data Sparse representation of the solutions (via support vectors): fast for making future prediction No probability estimates (hard classifiers)

Other Active Problems in SVM Variable/Feature Selection Linear SVM: Bradley and Mangasarian (1998), Guyon et al. (2000), Rakotomamonjy (2003), Jebara and Jaakkola (2000) Nonlinear SVM: Weston et al. (2002), Grandvalet (2003), Basis pursuit (Zhang 2003), COSSO selection (Lin & Zhang (2003) Proximal SVM faster computation Robust SVM get rid of outliers Choice of kernels

The L 1 SVM Overview of Multiclass Learning Replace the L 2 penalty by the L 1 penalty. The L 1 penalty tends to give sparse solutions. For f (x) = h(x) T β + β 0, the L 1 SVM solves min β 0,β n [1 y i f (x i )] + + λ i=1 d β j. (1) j=1 The solution will have at most n nonzero coefficients β j.

L 1 Penalty versus L 2 Penalty 4 3.5 3 2.5 J(w) 2 1.5 1 0.5 L 1 penalty L 2 penalty 0 2 1.5 1 0.5 0 0.5 1 1.5 2 w

Robust Support Vector Machines Hinge loss is unbounded; sensitive to outliers (e.g. wrong labels etc) Support Vectors: y i f (x i ) 1. Truncated hinge loss: T s (u) = H 1 (u) H s (u), where H s (u) = [s u] +. Remove some bad SVs (Wu and Liu, 2006).

Decomposition: Difference of Convex Functions Key: D.C. decomposition (Diff. Convex functions). T s (u) = H 1 (u) H s (u). 4 4 4 3 3 3 H 1 2 H s 2 T s 2 1 1 1 0 3 0 1 3 z 0 3 s 0 3 z 0 3 s 0 1 3 z

D.C. Algorithm Overview of Multiclass Learning D.C. Algorithm: The Difference Convex Algorithm for minimizing J(Θ) = J vex (Θ) + J cav (Θ) 1. Initialize Θ 0. 2. Repeat Θ t+1 = argmin Θ (J vex (Θ) + J cav (Θ t ), Θ Θ t ) until convergence of Θ t. The algorithm converges in finite steps (Liu et al. (2005)). Choice of initial values: Use SVM s solution. RSVM: The set of SVs is a only a SUBSET of the original one! Nonlinear learning can be achieved by the kernel trick.