Max-Margin Ratio Machine

Similar documents
ROBUST LINEAR DISCRIMINANT ANALYSIS WITH A LAPLACIAN ASSUMPTION ON PROJECTION DISTRIBUTION

Enhancing Generalization Capability of SVM Classifiers with Feature Weight Adjustment

Local Support Vector Regression For Financial Time Series Prediction

Support Vector Machines

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Robust Fisher Discriminant Analysis

Classifier Complexity and Support Vector Classifiers

Statistical Pattern Recognition

Perceptron Revisited: Linear Separators. Support Vector Machines

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008

Nonlinear Projection Trick in kernel methods: An alternative to the Kernel Trick

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Convex Optimization in Classification Problems

Support Vector Machine (continued)

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

y(x) = x w + ε(x), (1)

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Robust Kernel-Based Regression

Support Vector Machines Explained

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Predicting the Probability of Correct Classification

Jeff Howbert Introduction to Machine Learning Winter

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Introduction to Support Vector Machines

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Learning SVM Classifiers with Indefinite Kernels

Robust Novelty Detection with Single Class MPM

Support Vector Machines on General Confidence Functions

Support Vector Machines.

Local Learning vs. Global Learning: An Introduction to Maxi-Min Margin Machine

Support Vector Machines

A ROBUST BEAMFORMER BASED ON WEIGHTED SPARSE CONSTRAINT

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machines

Outliers Treatment in Support Vector Regression for Financial Time Series Prediction

Linear & nonlinear classifiers

CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum

A Reformulation of Support Vector Machines for General Confidence Functions

Support Vector Machine (SVM) and Kernel Methods

Learning Kernel Parameters by using Class Separability Measure

Linear & nonlinear classifiers

Linear models and the perceptron algorithm

Support Vector Machine. Industrial AI Lab.

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

L5 Support Vector Classification

Support Vector Machine (SVM) and Kernel Methods

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

SUPPORT VECTOR MACHINE

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Statistical Methods for SVM

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machines for Classification: A Statistical Portrait

CIS 520: Machine Learning Oct 09, Kernel Methods

Robust Support Vector Machine Training via Convex Outlier Ablation

Pattern Recognition 2018 Support Vector Machines

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Learning with kernels and SVM

Constrained Optimization and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

An introduction to Support Vector Machines

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Lecture Support Vector Machine (SVM) Classifiers

A Modified ν-sv Method For Simplified Regression A. Shilton M. Palaniswami

Neural networks and support vector machines

A Magiv CV Theory for Large-Margin Classifiers

Support Vector Machine

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Artificial Neural Networks. Part 2

A Tutorial on Support Vector Machine

Introduction to Support Vector Machines

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Solving Classification Problems By Knowledge Sets

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

SVMC An introduction to Support Vector Machines Classification

Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Day 4: Classification, support vector machines

Semi-Supervised Support Vector Machines

Margin Distribution Logistic Machine

A Robust Minimax Approach to Classification

Support Vector Machines

Support Vector Machine for Classification and Regression

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machines for Classification and Regression

LECTURE 7 Support vector machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture Notes on Support Vector Machine

CS145: INTRODUCTION TO DATA MINING

SVM optimization and Kernel methods

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Kernel Methods and Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Transcription:

JMLR: Workshop and Conference Proceedings 25:1 13, 2012 Asian Conference on Machine Learning Max-Margin Ratio Machine Suicheng Gu and Yuhong Guo Department of Computer and Information Sciences Temple University, Philadelphia, PA 19122, USA Editor: Steven C.H. Hoi and Wray Buntine Abstract In this paper, e investigate the problem of exploiting global information to improve the performance of SVMs on large scale classification problems. We first present a unified general frameork for the existing -max machine methods in terms of ithin-class dispersions and beteen-class dispersions. By defining a ne ithin-class dispersion measure, e then propose a novel max-margin ratio machine (MMRM) method that can be formulated as a linear programg problem ith scalability for large data sets. Kernels can be easily incorporated into our method to address non-linear classification problems. Our empirical results sho that the proposed MMRM approach achieves promising results on large data sets. 1. Introduction Support vector machines (SVMs) are one of the most popular classification methods that have been idely used in machine learning and related fields. The standard SVMs ere derived from purely geometric principles (Vapnik, 1995), here one attempts to solve for a consistent linear discriant that maximizes the imum Euclidean distance beteen any data points and the decision hyperplane. Although SVMs have demonstrated good performance in the literature, it has been noticed that the margins defined in SVMs are exclusively detered locally by a small set of data points called support vectors hereas all other data points are irrelevant to the decision hyperplane. The missing consideration for global statistical information in the data set could lead to the recognition of an inferior decision hyperplane over the training data in some cases (Huang et al., 2008). Motivated by this important observation, a ne large margin approach called maxi margin machines (M 4 ) as developed in (Huang et al., 2004a, 2008). The M 4 model takes both local information and global information into consideration by incorporating the ithin-class variance into the standard SVM formulation. It builds a connection beteen the standard SVM and a recently proposed imax probability machine (MPM) (Lanckriet et al., 2002). The MPM model, focusing on global statistical information, maximizes the distance beteen the class means and imizes the ithin-class variance. One extension to the MPM, imum error imax probability machine (MEMPM), has been proposed in (Huang et al., 2004b) hich contains an explicit performance indicator. Another extension is the structured large margin machine (SLMM) proposed in (Yeung et al., 2007) hich is sensitive to the data structures. Another earlier approach that is relevant in the context of exploring both ithin-class and beteen-class measures is linear discriant analysis c 2012 S. Gu & Y. Guo.

Gu Guo (LDA) (Duda and Hart, 1973), hich imizes the ithin-class dispersion and maximizes the beteen-class dispersion. Hoever, these techniques developed to improve (or have the potential to improve) the standard SVMs suffer from an evident draback of lacking scalability. They all require to estimate covariance matrices from the data, hich is very sensitive to specific data samples, e.g., the outliers, especially in small sample size problems. Moreover, the MPM is solved via a second order cone program (SOCP), and the M 4 and SLMM require to solve sequential SOCP programs. For high dimensional and large scale data sets, these approaches suffer from coherent limitations regarding to computational complexities. More recently, a relative margin machine (RMM) method as introduced to overcome the sensitivity of SVMs over large data spread directions (Shivasamy and Jebara, 2008, 2010). The RMM maximizes the margin relative to the spread of the data by bounding the projections of training examples in an area ith radius less than a threshold value B. It as shon that a proper B can help improve the prediction accuracy. Hoever, the model is very sensitive to the value of B and the B is not easy to tune. When B is too large, the constraint ill be inactive and the solution ill be the same as the SVM. When B is too small, all training samples are bounded by B and the large margin issue ill not be appreciated enough. In this paper, e first sho many models mentioned above, such as M 4, MPM, LDA, SVMs and RMM, can be expressed in a unified general -max machine frameork in terms of ithin-class dispersion and beteen-class dispersion measures. We then propose a novel max-margin ratio machine (MMRM) ithin the -max machine frameork based on a ne ithin-class dispersion definition. This MMRM can be reformulated into an equivalent linear programg problem and it possesses scalability over large data sets. Kernels can be incorporated into the MMRM to cope ith non-linear classification problems. Our empirical results sho that the proposed MMRM approach can achieve higher classification accuracies on large data sets comparing to the standard SVMs and the RMM method. 2. Min-Max Machine Frameork In this section, e present a -max machine frameork to address binary classification problems. First e need to define some notations used in this section and the hole paper. We use X = {x i } nx i=1 to denote the positive samples, and Y = {y j} ny j=1 to denote the negative samples, here x i, y j R m denote the positive and negative sample vectors, n x, denote the numbers of positive and negative samples respectively, and n = n x + denotes the size of the training set. We use ( x, Σ x ) and (ȳ, Σ y ) to denote the mean vectors and variance matrices for each class respectively, such that x = 1 nx n x i=1 x i, ȳ = 1 ny j=1 y j, Σ x = n x i=1 (x i x)(x i x), and Σ y = j=1 (y j ȳ)(y j ȳ). We consider the problem of training a good linear classifier. The standard SVMs identify the optimal linear classifier f(x) = sign( x + b) as the one that maximizes the margin beteen the to classes, hich is equivalent to maximizing the imum projected beteen-class distance for linear separable data. Hoever, it has been noticed that SVMs sometimes miss the optimal solution due to the fact that they ignored global statistical information (Huang et al., 2008). M 4, MP M and RMM methods are then proposed to tackle this draback of SVMs by either considering both beteen-class and ithin-class 2

Max-Margin Ratio Machine dispersions, or considering both the margin and the bounding for all training instances. Here e define a -max machine frameork to generalize all these models. A -max machine aims to detere a hyperplane H(; b) = {x x = b}, here R m and b R, to separate to classes of data points by imizing the ithin-class dispersion and maximizing the beteen-class dispersion. More specifically, e can define the -max machine as a general optimization problem d () d b () (1) here d () and d b () denote the ithin-class dispersion and beteen-class dispersion respectively. By replacing d () and d b () ith different specific dispersion measures, one can obtain a set of variants of -max machine models. Assug linear separable data, e ill sho belo that LDA, MPM, SVM, M 4 and RMM can be formulated ithin this -max machine frameork. 2.1. Linear Discriant Analysis (LDA) LDA is a ell knon discriative method ith an optimization goal of imizing the ithin-class dispersion and maximizing the beteen-class dispersion, hich is consistent ith the -max machine frameork e proposed. LDA optimization is typically defined as max Σ b Σ, here Σ b = ( x ȳ)( x ȳ) and Σ = Σ x + Σ y. Thus the ithin-class dispersion for LDA is defined as the squared root of the sum of the projected ithin-class variances of the to classes d v () = Σ x + Σ y (2) and the beteen-class dispersion of LDA is defined as the projected distance beteen the to mean (center) vectors of the to classes d c b () = ( x ȳ)( x ȳ) = ( x ȳ) (3) 2.2. Minimax Probability Machine (MPM) The optimization problem for MPM is formulated as Σ x + Σ y s.t. ( x ȳ) = 1 (4) hich is solved using a second order cone program (SOCP) (Lanckriet et al., 2002). One can easily verify that this problem is equivalent to a -max machine problem ith the beteen-class dispersion d c b () defined in (3) and the ithin-class dispersion defined as the sum of the standard deviation of the to classes d s () = d s x() + d s y() = Σ x + Σ y (5) 3

Gu Guo 2.3. Support Vector Machines (SVMs) The hyperplane of the standard SVM can be vieed as being detered by imizing a simple ithin-class dispersion in terms of the norm of (note it is obvious that a smaller norm of ill result in a smaller ithin-class dispersion.) d n = ( ) 1 1 2 2 (6) and maximizing the beteen-class margin distance defined beteen the samples closest to the margin (the support vectors) d m b () = (x y ), (7) here x and y are the margin samples in class X and Y respectively such that x = arg x i x i (8) y = arg max y j y j (9) Applying the ithin-class dispersion and the beteen-class dispersion defined in (6) and (7), the -max machine problem in (1) ould result to the standard linear separable SVM,b 1 2 subject to x i + b 1, i; ( y j + b) 1, j. (10) here the bias term is b = b 1 = x + y. (11) 2 2.4. Maxi-Min Margin Machine (M 4 ) The M 4 is proposed in an attempt to integrate SVM and MPM by considering both discriative margin information and the global statistical information. It is formulated as an optimization problem belo max ρ,,b ρ (12) subject to x + b ρ Σ x, i; ( y + b) ρ Σ y, j hich can be put into the frameork of -max machine by using the ithin-class dispersion, d s (), defined in (5) and the beteen-class margin distance, d m b (), defined in (7). The bias term b for M 4 is computed via b = b 2 = y ds y() d s () ( x y ). 4

Max-Margin Ratio Machine d y m d b m d x m { T y T y * b 1 b 2 { T x * T x T x i Figure 1: The mapping of various samples in the space. x, ȳ: mean of the positive and negative classes, x, y : margin samples (support vectors), b 1, b 2 : bias. x i : an outlier sample. 2.5. Relative Margin Machines (RMM) The RMM maximizes the classification margin relative to the spread of the data. formulated as,b It is 1 2 (13) subject to x i + b 1, x i + b B, i; ( y j + b) 1, ( y j + b) B, j. It is easy to verify that the above problem is equivalent to a -max machine problem here the ithin-class dispersion is defined as the regularized projected diameter of the hole data set d d () = max i,j ( x i y j ) + D 2 (14) and the beteen-class margin distance, d m b (), is defined in (7). The D in (14) is a trade-off parameter. The techniques e presented above suggest that a combination of different ithin-class dispersions and beteen-class dispersions (projected to the space) ithin the -max machine frameork can lead to models ith strength in different perspectives. Figure 1 provides an intuitive understanding about outlier samples, class mean vectors, marginal samples (support vectors) and the bias terms in the projected space. 3. Max-Margin Ratio Machine Except support vector machines and relative margin machines, all the other three methods e revieed ithin the -max machine frameork above require the estimation of covariance matrices, hich could make the computation steps in these techniques inefficient for high dimensional or large data sets. Moreover, covariance matrices are typically not very robust to outlier samples. Due to these observations, in this section e propose a ne ithin-class margin dispersion in the -max machine frameork, hich leads to a novel classification method, max-margin ratio machine. 5

Gu Guo 3.1. Max-Margin Ratio Machine Model Consider the ithin-class dispersion d s () defined in (5) for MPM. It can be reritten into the follo by expanding the covariance terms d s () = d s x() + d s y() = ( n x ( x i x) 2) 1 2+ i=1 ( ny ( y j ȳ) 2) 1 2 No e replace the Frobenius norm in (15) ith L norm, then a ne ithin-class disperse measure can be obtained as belo ( x i x d) d+ 1 ( y j ȳ d) 1 d lim d = max i i x x i + max j j j=1 y j ȳ (16) here covariance terms are avoided. For linear separable data, samples can only appear on the correct side of the separation hyperplane, and the outliers ill be far aay from the separation hyperplane. Thus to reduce the influence of possible outliers, e can consider only samples x i and y j that satisfy x x i 0 and y j ȳ 0. From Figure 1, e can see these samples are the ones that lie beteen each class center and the separation hyperplane. With this additional constraint, the absolute value sign in (16) can be dropped, and a novel ithin-class dispersion measure can be obtained d m () = max i ( x x i ) + max j ( y j ȳ) = ( x x ) + ( y ȳ) hich is the sum of the distances beteen the margin samples, x and y, and the class mean vectors. This ne ithin-class dispersion does not require data covariance terms and hopefully the computational cost can be reduced. The outliers (see x i in Figure 1 ) that are far aay from the hyperplane, counted only in the estimation of the mean vectors, have very small influence on this dispersion measure. This ne ithin-class dispersion measure also builds a connection beteen the beteen-class center distance in (3) and the beteen-class margin distances in (7), such that (15) (17) d m () = d c b () dm b (). (18) Combining this ne ithin-class dispersion (17) and the beteen-class margin distance d m b () of (7) used in SVMs in a similar ay as the M4 ithin the -max machine frameork, e can obtain the folloing optimization problem max ρ,,b ρ (19) subject to x i + b ρ ( x x i ), i; ( y j + b) ρ (y j ȳ), j (20) (), and the ithin- here the ρ represents the ratio beteen the beteen-class dispersion d m b class dispersion d m (), hich is stated in the folloing theorem. 6

Max-Margin Ratio Machine Theorem 1 For linear separable data, the folloing equation stands beteen the optimal solution (ρ, ) for the optimization problem (19), the ithin-class dispersion d m ( ) and the beteen-class margin distance d m b ( ) ρ = dm b ( ) d m ( ) (21) Proof: According to the constraints of (19), hen is fixed, the samples {x i } and {y j } have no effect on ρ if ( x x i ) 0 or (y j ȳ) 0. The optimal ρ corresponding to is ρ = max b = max b max b { x i + b i,j ( x x i ), ( y j + b) (y j ȳ) { x + b ( x x ), ( y + b) } (y ȳ) x + b ( y + b) ( x x ) + (y ȳ) = dm b () d m () } ( x x i ) > 0, (y j ȳ) > 0 Note that, e use the fact that if b, d > 0, then ( a b, c d ) a+c b+d. The equality holds if and only if x + b ( x x ) = ( y + b) (y ȳ) hich implies b = y (22) (y ȳ) ( x x ) + (y ȳ) ( x y ). (23) According to this interpretation, e therefore name the model e proposed in (19) above as max-margin ratio machine. Note the optimization problem in (19) is not convex due to the existence of the bilinear term ρ in the constraints. Although sequential quadratic programg can be developed to solve this problem, it ill not be an efficient solution. 3.2. An Equivalent Alternative Formulation In order to obtain a simple optimization problem, e next try to integrate the proposed ithin-class dispersion d m () in (17) ith the much simpler beteen-class dispersion d c b () defined in (3). Within the -max machine frameork, this optimization problem can be literally formulated as max ( x ȳ) (24) subject to ( x x i ) + ( y j ȳ) 1 i, j. More interestingly, e ill sho belo that the to optimization problems (19) and (24) e formulated above yield the same optimal solution. 7

Gu Guo Lemma 2 Assug linear separable training data, the optimization problems (19) and (24) have the same optimal solution ith respect to. Proof: We kno d m () = d c b () dm b () (see (18)). If is the optimal solution of (19), then ρ( ) = dm b ( ) d m ( ) = max d m b () d m (). The optimization problem (24) is max d c b () d m () = max d m b () + dm () d m b d m = max () () d m () + 1 = ρ( ) + 1. Therefore, is also the optimal solution of (24). Lemma 2 suggests the optimization problems (19) and (24) are equivalent for linear separable data. Belo e ill further exploit the relationship beteen d m (), d c b () and () to construct an even simpler formulation. We consider the folloing equations d m b arg max d m b () d m = arg () d m () d m b + 1 = arg () d m () + d m b () d m b () = arg d c b () d m b (). This suggests a simple linear programg optimization problem as belo,b ( x ȳ) subject to x i + b 1, i; ( y j + b) 1, j. (25) Theorem 3 If is the optimal solution of (25), then is the optimal solution of (19), and the optimal margin ration of (19) is ρ = 2 ( x ȳ) 2. Proof: Assume is an optimal solution of (25). Let x and y be margin samples of the to classes along the direction, as defined in (8) and (9). Then it is straightforard to have T (x y 2 ) = 2. Let b 2 be defined as in (23), then (, ρ = T ( x ȳ) 2, b 2) is a feasible solution for (19) according to the proof in Theorem 1. If is not an optimal solution for 2 (19), there ill exist another feasible solution for (19), ( 1, ρ 1 ), satisfying ρ 1 > T ( x ȳ) 2. Let x 1 = arg x i T 1 x i (26) y 1 = arg max y j T 1 y j (27) Assume T 1 (x1 y 1 ) = 2 s. Let 2 = s 1, then ( 2, ρ 1 ) is a feasible solution for (19) as ell. Moreover, T 2 (x1 y 1 ) = 2, and 2 is a feasible solution for (25) since there ill be a b value such that T 2 x i + b T 2 x 1 + b = 1, i; ( T 2 y j + b) ( T 2 y 1 + b) = 1, j. Thus e have T 2 ( x ȳ) = 2 ρ 1 + 2 < 2 ρ + 2 = T ( x ȳ) (28) 8

Max-Margin Ratio Machine This inequality means is not an optimal solution for (25), hich contradicts the assumption. The theorem above suggests that one can solve the max-margin ratio machine problem (19) and the problem (24) by solving a simple linear programg problem (25). Belo e sho that the convex linear programg (25) is a bounded optimization problem. Lemma 4 If satisfies the constraints in (25), then ( x ȳ) 2. Proof: It is straightforard to get the conclusion by sumg all the constraint inequalities in (25). 3.3. Slack Variables for Non-separable Data All derivations e conducted above are for linear separable data. For non-separable problems, e can add slack variables to cope ith misclassification errors, hich yields a soft margin model as follo,b,ξ ( n x n y ) ( x ȳ) + C ξ i + η j i=1 j=1 subject to x i + b 1 ξ i, ξ i 0, i; ( y j + b) 1 η j, η j 0, j. (29) From no on, e ill refer to this model as max-margin ratio machine (MMRM). This is, again, a standard linear programg problem, hich can be easily solved. Note that, e can let C max( 1 n x, 1 ) to ensure the problem have bounded solution (according to the proof of Lemma 4). 3.4. Kernelization of MMRM We have focused on addressing linear classification problems so far. Hoever, it has been ell knon that many linear non-separable problems are actually separable in nonlinear high dimensional space. In order to deal ith nonlinear classification problems, e exploit the standard kernelization technique to kernelize MMRM. This is done by introducing a nonlinear mapping function ϕ : R m R f to map the original samples into a high dimensional feature space R f, and reriting the parameter as n x = α k ϕ(x k ) β l ϕ(y l ) (30) k=1 for nonnegative parameters {α k } and {β l }. The kernelized MMRM can then be obtained by replacing the in (29) ith (30) and using a Kernel function K(, ) to replace the inner l=1 9

Gu Guo product of to high dimensional vectors, ϕ( ) ϕ( ), α,β,b,ξ subject to n x k=1 α k ( 1 n x n x i=1 K(x k, x i ) 1 j=1 ) K(x k, y j ) ( 1 n x β l K(y l, x i ) 1 ) K(y l, y j ) n x l=1 n x k=1 i=1 j=1 α k K(x k, x i ) β l K(y l, x i )+b 1 ξ i, i; l=1 n x β l K(y l, y j ) α k K(x k, y j ) b 1 ξ j, j; l=1 k=1 ( n x + C α k 0, β l 0, k, l; ξ i 0, η j 0, i, j. i=1 n y ) ξ i + η j j=1 (31) Many semidefinite kernel functions can be used here. Each different kernel function corresponds to a different feature mapping function. In the experiments of this paper, e in particular used RBF kernels. By solving the above linear programg problem,the optimal nonlinear hyperplane (, b) can be obtained implicitly. Given a ne sample z, it can be classified by the folloing function n x n y f(z) = T z + b = α k K(z, x k ) β l K(z, y l ) + b (32) k=1 Similarly as in SVMs, hen α k > 0 and β l > 0, the corresponding x k and y l are support vectors. The α and β are usually sparse. We only need to compute kernel values beteen the fe support vectors and the test sample for the prediction. 4. Experiments In this section, e report the experimental results on both binary data sets and multiclass data sets. The MMRM is compared ith four other -max approaches: SVM, RMM, MPM and M 4. Sedumi (Sturm, 1999) is employed to train MMRM, M 4 and MPM, and the libsvm (Fan et al., 2005) is employed to train SVM. We used the RMM code donloaded from Internet. 1 The RBF kernels used in the experiments are defined as K(x, z) = exp( g x z 2 ). 4.1. Results on UCI Data Sets We first conducted experiments on three data sets, Ionosphere, Pima and Sonar. Each data set as randomly partitioned into 90% training and 10% test sets. We tested both linear classification models and kernelized classification models ith RBF kernels. The parameter g for RBF kernel, the parameter B in RMM, and the trade-off parameter C in each comparison approach ere tuned using cross validation. Classification accuracies and their standard deviations are reported in Table 1. The reported results are the averages 1. http://.cs.columbia.edu/~pks2103/rmm/ l=1 10

Max-Margin Ratio Machine Table 1: Comparisons of classification accuracies (%) and standard deviations among MMRM, SVM, RMM, M 4 and MPM. Linear kernel data MMRM SVM RMM M 4 MPM Ionosphere 87.8(0.3) 86.7(0.3) 85.2(0.3) 87.7(0.4) 85.2(0.4) Pima 76.6(0.4) 77.9(0.4) 76.1(0.5) 77.7(0.4) 77.5(0.4) Sonar 76.0(1.2) 72.4(1.2) 74.3(1.2) 74.6(1.2) 71.6(1.2) RBF kernel data MMRM SVM RMM M 4 MPM Ionosphere 94.4(0.3) 94.0(0.2) 94.3(0.3) 94.2(0.3) 92.3(0.3) Pima 76.9(0.4) 78.0(0.5) 76.2(0.5) 77.6(0.4) 76.2(0.5) Sonar 88.1(0.6) 86.5(0.7) 86.9(0.7) 87.3(0.6) 84.9(0.7) Table 2: Comparisons of classification errors of MMRM, RMM and SVM ith RBF kernels. Data Info Errors data sets #train #test #features #classes MMRM RMM SVM Isolet 6238 1559 617 26 47 50 52 Usps 7291 2007 256 10 89 96 98 Letters 16000 4000 16 26 80 88 88 Mnist 60000 10000 784 10 124 129 131 over 50 random partitions for linear kernels and RBF kernels respectively. The proposed MMRM obtained highest accuracies on to of the three data sets. Moreover, the results suggest that kernelization can improve the classification performance of these approaches on non-separable data sets. 4.2. Results on Large Data Sets Although our empirical results on regular UCI data sets are promising, e are more interested in investigating the performance of the proposed MMRM on large scale data sets, here previous extensions of SVMs, such as MPM and M 4, are computationally expensive. We then conducted experiments on four large scale data sets: Letters, Isolet, Usps and Mnist. Information about these data sets is given in Table 2. For Isolet, Usps and Letters, e used the original data ithout further preprocessing. Principal components analysis (PCA) as used to reduce the dimensionality of Mnist from 784 don to 80 to speed up training. We compared the proposed MMRM approach ith standard SVMs and RMMs on these four data sets hile the MPM and M 4 are not tested due to their high computational complexity. Large data sets are often linear non-separable, hence e used RBF kernels. Moreover, these four data sets have multiple classes, and e thus need to address multiclass classification problems. In this study, e used the one-against-one strategy and trained M(M 1)/2 binary classifiers for each M-class problem. On each data set, e selected parameters g and C using SVMs via a 5-fold cross validation, then used the same g 11

Gu Guo 100 99.5 Prediction Accuracy ( % ) 99 98.5 98 97.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Margin Ratio (ρ) Figure 2: Relationship beteen prediction accuracy and margin ratio on USPS data set. and C for both RMM and MMRM. The additional parameter B in RMM as also selected by cross validation. Test errors of the three methods are reported in Table 2. The RMM outperformed the SVM in most cases. Hoever, the price of this improvement is the addition of an extra parameter hich is difficult to select and limits the generalization of the model. The proposed MMRM obtained the loer test errors on all four data sets than both SVMs and RMMs ithout adding any additional parameters. In most cases, the prediction errors yielded by MMRMs are 10% loer than that by SVMs. The MMRM is a linear programg problem hile the SVM is a quadratic programg problem. 4.3. Margin Ratio vs. Prediction Accuracy We also conducted experiments to investigate the relationship beteen the margin ratio (ρ) achieved by the MMRM and the prediction accuracy it produced. We used the Usps data set. We trained 45 MMRM models, here each model is trained for a pair of to classes. We plot the margin ratio on training set and the corresponding prediction accuracy on testing set of each pair in Figure 2. It suggests that the margin ratio of the model on training data is positively correlated ith the prediction accuracy on testing data. We used a linear regression to model this correlation relationship: prediction accuracy = 0.20354ρ + 98.182, hich implies the prediction accuracy increases linearly as the margin ratio increases. The regression function is shon as a red line in Figure 2. This verifies our assumption that maximizing margin ratio can lead to a classification model ith good generalization performance on testing data. 12

Max-Margin Ratio Machine 5. Conclusion In this paper, a unified general frameork for the existing -max machine methods as presented in terms of ithin-class dispersions and beteen-class dispersions. Then a ne ithin-class dispersion measure as introduced hich leads to a novel max-margin ratio machine (MMRM) method. The MMRM can be formulated as a linear programg problem and has scalability for large data sets. Kernel techniques ere used to derive the non-linear version of MMRM to cope ith non-linear classification problems. The empirical results sho that the proposed MMRM approach can achieve higher classification accuracies on large data sets comparing to the standard SVMs and relative margin machines (RMMs). References R. Duda and P. Hart. Pattern Classification and Scene Analysis. Ne York: Wiley, 1973. R. Fan, P. Chen, and C. Lin. Working set selection using second order information for training support vector machines. JMLR, 6:1889 1918, 2005. K. Huang, H. Yang, I. King, and M. Lyu. Learning large margin machines locally and globally. In Proc. of ICML, 2004a. K. Huang, H. Yang, I. King, M. Lyu, and L. Chan. Minimum error imax probability machine. JMLR, 5:1253 1286, 2004b. K. Huang, H. Yang, I. King, and M. Lyu. Maxi- margin machine: Learning large margin classifiers locally and globally. IEEE Tran. on Neural Netorks, 19(2):260 272, 2008. G. Lanckriet, L. Ghaoui, C. Bhattacharyya, and M. Jordan. A robust imax approach to classification. JMLR, 3:555 582, 2002. P. Shivasamy and T. Jebara. Relative margin machines. In Proc. of NIPS, 2008. P. Shivasamy and T. Jebara. Maximum relative margin and data-dependent regularization. JMLR, 11:747 788, 2010. J. Sturm. Using SeDuMi 1.02, a matlab toolbox for optimization over symmetric cones. Optimization methods and softare, 11:625 653, 1999. V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag Ne York, 1995. D. Yeung, D. Wang, W. Ng, E. Tsang, and X. Wang. Structured large margin machines: sensitive to data distributions. Machine Learning, 68:171 200, 2007. 13