SVM Tutorial: Classification, Regression, and Ranking

Similar documents
Kernel Methods and SVMs Extension

Natural Language Processing and Information Retrieval

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Which Separator? Spring 1

Lecture 10 Support Vector Machines II

Support Vector Machines

Support Vector Machines CS434

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Linear Classification, SVMs and Nearest Neighbors

Support Vector Machines CS434

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Lecture 3: Dual problems and Kernels

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

10-701/ Machine Learning, Fall 2005 Homework 3

Support Vector Machines

Support Vector Machines

Chapter 6 Support vector machine. Séparateurs à vaste marge

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Support Vector Machines

Intro to Visual Recognition

Lagrange Multipliers Kernel Trick

Lecture 10 Support Vector Machines. Oct

Feature Selection: Part 1

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

COS 521: Advanced Algorithms Game Theory and Linear Programming

Support Vector Machines

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Generalized Linear Methods

CSE 252C: Computer Vision III

Kristin P. Bennett. Rensselaer Polytechnic Institute

Ensemble Methods: Boosting

CSC 411 / CSC D11 / CSC C11

The Minimum Universal Cost Flow in an Infeasible Flow Network

Maximal Margin Classifier

MMA and GCMMA two methods for nonlinear optimization

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lecture 20: November 7

1 Convex Optimization

EEE 241: Linear Systems

More metrics on cartesian products

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Assortment Optimization under MNL

Pattern Classification

Problem Set 9 Solutions

Multilayer Perceptron (MLP)

Linear Feature Engineering 11

Lecture 6: Support Vector Machines

Regularized Discriminant Analysis for Face Recognition

17 Support Vector Machines

Supporting Information

Advanced Introduction to Machine Learning

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Lecture Notes on Linear Regression

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

FMA901F: Machine Learning Lecture 5: Support Vector Machines. Cristian Sminchisescu

Errors for Linear Systems

Semi-supervised Classification with Active Query Selection

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Online Classification: Perceptron and Winnow

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

The Order Relation and Trace Inequalities for. Hermitian Operators

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

18-660: Numerical Methods for Engineering Design and Optimization

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

Linear, affine, and convex sets and hulls In the sequel, unless otherwise specified, X will denote a real vector space.

Multigradient for Neural Networks for Equalizers 1

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

APPENDIX A Some Linear Algebra

VQ widely used in coding speech, image, and video

6.854J / J Advanced Algorithms Fall 2008

Learning with Tensor Representation

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

ON A DETERMINATION OF THE INITIAL FUNCTIONS FROM THE OBSERVED VALUES OF THE BOUNDARY FUNCTIONS FOR THE SECOND-ORDER HYPERBOLIC EQUATION

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Week 5: Neural Networks

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Fisher Linear Discriminant Analysis

Numerical Heat and Mass Transfer

Report on Image warping

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.

A NEW ALGORITHM FOR FINDING THE MINIMUM DISTANCE BETWEEN TWO CONVEX HULLS. Dougsoo Kaown, B.Sc., M.Sc. Dissertation Prepared for the Degree of

Singular Value Decomposition: Theory and Applications

Difference Equations

Kernel Methods and SVMs

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

Computing Correlated Equilibria in Multi-Player Games

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

On a direct solver for linear least squares problems

Nonlinear Classifiers II

Linear Approximation with Regularization and Moving Least Squares

On the Multicriteria Integer Network Flow Problem

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

The Study of Teaching-learning-based Optimization Algorithm

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

PHYS 705: Classical Mechanics. Calculus of Variations II

Transcription:

SVM Tutoral: Classfcaton, Regresson, and Rankng Hwanjo Yu and Sungchul Km 1Introducton Support Vector Machnes(SVMs) have been extensvely researched nthedatamnng and machne learnng communtes for the last decade and actvely appled to applcatons n varous domans. SVMs are typcally used for learnng classfcaton, regresson, or rankng functons, for whch they are called classfyng SVM, support vector regresson (SVR), or rankng SVM (or RankSVM)respectvely.Two specal propertes of SVMs are that SVMs acheve (1) hgh generalzaton by maxmzng the margn and (2) support an effcent learnng of nonlnear functons by kernel trck.thschapter ntroduces thesegeneral concepts and technques of SVMs for learnng classfcaton, regresson, and rankng functons. In partcular, we frst present the SVMs for bnary classfcaton n Secton 2, SVR n Secton 3, rankng SVM n Secton 4, and another recently developed method for learnng rankng SVM called Rankng Vector Machne (RVM) n Secton 5. 2SVMClassfcaton SVMs were ntally developed for classfcaton [5] and have beenextendedforregresson [23] and preference (or rank) learnng [14, 27]. The ntal form of SVMs s a bnary classfer where the output of learned functon s ether postve or negatve. A multclass classfcaton can be mplemented by combnng multple bnary classfers usng parwse couplng method [13, 15]. Ths secton explans the mot- Hwanjo Yu POSTECH, Pohang, South Korea, e-mal: hwanjoyu@postech.ac.kr Sungchul Km POSTECH, Pohang, South Korea, e-mal: subrght@postech.ac.kr 1

2 Hwanjo Yu and Sungchul Km vaton and formalzaton of SVM as a bnary classfer, and the two key propertes margn maxmzaton and kernel trck. Fg. 1 Lnear classfers (hyperplane) n two-dmensonal spaces Bnary SVMs are classfers whch dscrmnate data ponts of two categores. Each data object (or data pont) s represented by a n-dmensonal vector. Each of these data ponts belongs to only one of two classes. A lnear classfer separates them wth an hyperplane. For example, Fg. 1 shows two groups of data and separatng hyperplanes that are lnes n a two-dmensonal space. There are many lnear classfers that correctly classfy (or dvde) the two groups of data such as L1, L2 and L3 n Fg. 1. In order to acheve maxmum separaton between the two classes, SVM pcks the hyperplane whch has the largest margn. The margn s the summaton of the shortest dstance from the separatng hyperplanetothenearestdatapont of both categores. Such a hyperplane s lkely to generalze better,meanng that the hyperplane correctly classfy unseen or testng data ponts. SVMs does the mappng from nput space to feature space to support nonlnear classfcaton problems. The kernel trck s helpful for dong ths by allowng the absence of the exact formulaton of mappng functon whch could cause the ssue of curse of dmensonalty. Ths makes a lnear classfcaton n the new space (or the feature space) equvalent to nonlnear classfcaton n theorgnalspace(orthe nputspace). SVMsdothesebymappng nputvectorstoahgher dmensonal space (or feature space) where a maxmal separatng hyperplane s constructed.

SVM Tutoral: Classfcaton, Regresson, and Rankng 3 2.1 Hard-margn SVM Classfcaton To understand how SVMs compute the hyperplane of maxmal margn and support nonlnear classfcaton, we frst explan the hard-margn SVM where the tranng data s free of nose and can be correctly classfed by a lnear functon. The data ponts D n Fg. 1 (or tranng set) can be expressed mathematcally as follows. D = {(x 1,y 1 ),(x 2,y 2 ),...,(x m,y m )} (1) where x s a n-dmensonal real vector, y s ether 1 or -1 denotng the class to whch the pont x belongs. The SVM classfcaton functon F(x) takes the form F(x)=w x b. (2) w s the weght vector and b s the bas, whch wll be computed by SVM n the tranng process. Frst, to correctly classfy the tranng set, F( ) (or w and b) mustreturnpostve numbers for postve data ponts and negatve numbers otherwse, that s, for every pont x n D, These condtons can be revsed nto: w x b > 0fy = 1,and w x b < 0fy = 1 y (w x b) > 0, (x,y ) D (3) If there exsts such a lnear functon F that correctly classfes every pont n D or satsfes Eq.(3), D s called lnearly separable. Second, F (or the hyperplane) needs to maxmze the margn. Margnsthedstancefromthehyperplane totheclosestdataponts. Anexample of such hyperplane s llustrated n Fg. 2. To acheve ths, Eq.(3) s revsed nto the followng Eq.(4). y (w x b) 1, (x,y ) D (4) Note that Eq.(4) ncludes equalty sgn, and the rght sde becomes 1 nstead of 0. If D s lnearly separable, or every pont n D satsfes Eq.(3), then there exsts such a F that satsfes Eq.(4). It s because, f there exst such w and b that satsfy Eq.(3), they can be always rescaled to satsfy Eq.(4) The dstance from the hyperplane to a vector x s formulated as F(x ) w.thus,the margn becomes margn = 1 w (5)

4 Hwanjo Yu and Sungchul Km Fg. 2 SVM classfcaton functon: the hyperplane maxmzng the margn n a two-dmensonal space because when x are the closest vectors, F(x) wll return 1 accordng to Eq.(4). The closest vectors, that satsfy Eq.(4) wth equalty sgn, arecalledsupport vectors. Maxmzng the margn becomes mnmzng w.thus,the tranng problem n SVM becomes a constraned optmzaton problem as follows. mnmze: Q(w)= 1 2 w 2 (6) subject to: y (w x b) 1, (x,y ) D (7) The factor of 1 2 s used for mathematcal convenence. 2.1.1 Solvng the Constraned Optmzaton Problem The constraned optmzaton problem (6) and (7) s called prmal problem. It s characterzed as follows: The objectve functon (6) s a convex functon of w. The constrants are lnear n w. Accordngly, we may solve the constraned optmzaton problem usng the method of Largrange multplers [3]. Frst, we construct thelargrangefuncton: J(w,b,α)= 1 2 w w m =1α {y (w x b) 1} (8)

SVM Tutoral: Classfcaton, Regresson, and Rankng 5 where the auxlary nonnegatve varables α are called Largrange multplers. The soluton to the constraned optmzaton problem s determned by the saddle pont of the Lagrange functon J(w,b,α), whchhastobemnmzedwthrespecttow and b;talsohastobemaxmzedwthrespecttoα.thus,dfferentatngj(w,b,α) wth respect to w and b and settng the results equal to zero, we get the followng two condtons of optmalty: Condton1: J(w,b,α) w = 0 (9) Condton2: J(w,b,α) b = 0 (10) After rearrangement of terms, the Condton 1 yelds and the Condton 2 yelds w = m =1 m =1 α y,x (11) α y = 0 (12) The soluton vector w s defned n terms of an expanson that nvolves the m tranng examples. As noted earler, the prmal problem deals wth a convex cost functon and lnear constrants. Gven such a constraned optmzaton problem, t s possble to construct another problem called dual problem.thedualproblemhasthesameoptmal value as the prmal problem, but wth the Largrange multplers provdng the optmal soluton. To postulate the dual problem for our prmal problem, we frst expand Eq.(8), term by term, as follows: J(w,b,α)= 1 2 w w m =1α y w x b m =1 α y + m =1 α (13) The thrd term on the rght-hand sde of Eq.(13) s zero by vrtue of the optmalty condton of Eq.(12). Furthermore, from Eq.(11) we have w w = m =1 α y w x = m m =1 j=1 α α j y y j x x j (14) Accordngly, settng the objectve functon J(w, b, α)=q(α), we can reformulate Eq.(13) as m Q(α)= =1α 1 2 where the α are nonnegatve. We now state the dual problem: m =1 m j=1 α α j y y j x x j (15)

6 Hwanjo Yu and Sungchul Km maxmze: Q(α)=α 1 2 subject to: α α j y y j x x j (16) j α y = 0 (17) α 0 (18) Note that the dual problem s cast entrely n terms of the tranng data. Moreover, the functon Q(α) to be maxmzed depends only on the nput patterns n the form of a set of dot product {x x j } m (, j)=1. Havng determned the optmum Lagrange multplers, denoted by α,wemay compute the optmum weght vector w usng Eq.(11) and so wrte w = α y x (19) Note that accordng to the property of Kuhn-Tucker condtons of optmzaton theory, The soluton of the dual problem α must satsfy the followng condton. α {y (w x b) 1} = 0for = 1,2,...,m (20) and ether α or ts correspondng constrant {y (w x b) 1} must be nonzero. Ths condton mples that only when x s a support vector or y (w x b)=1, ts correspondng coeffcent α wll be nonzero (or nonnegatve from Eq.(18)). In other words, the x whose correspondng coeffcents α are zero wll not affect the optmum weght vector w due to Eq.(19). Thus, the optmum weght vector w wll only depend on the support vectors whose coeffcents are nonnegatve. Once we compute the nonnegatve α and ther correspondng suppor vectors, we can compute the bas b usng a postve support vector x from the followng equaton. The classfcaton of Eq.(2) now becomes as follows. b = 1 w x (21) F(x)=α y x x b (22) 2.2 Soft-margn SVM Classfcaton The dscusson so far has focused on lnearly separable cases. However, the optmzaton problem (6) and (7) wll not have a soluton f D s not lnearly separable. To deal wth such cases, soft margn SVM allows mslabeled data ponts whle stll maxmzng the margn. The method ntroduces slack varables, ξ,whchmeasure

SVM Tutoral: Classfcaton, Regresson, and Rankng 7 the degree of msclassfcaton. The followng s the optmzaton problem for soft margn SVM. mnmze: Q 1 (w,b,ξ )= 1 2 w 2 +C ξ (23) subject to: y (w x b) 1 ξ, (x,y ) D (24) ξ 0 (25) Due to the ξ n Eq.(24), data ponts are allowed to be msclassfed, and the amount of msclassfcaton wll be mnmzed whle maxmzng the margn accordng to the objectve functon (23). C s a parameter that determnes the tradeoff between the margn sze and the amount of error n tranng. Smlarly to the case of hard-margn SVM, ths prmal form can be transformed to the followng dual form usng the Lagrange multplers. maxmze: Q 2 (α)= subject to: α α α j y y j x x j (26) j α y = 0 (27) C α 0 (28) Note that nether the slack varables ξ nor ther Lagrange multplers appear n the dual problem. The dual problem for the case of nonseparable patterns s thus smlar to that for the smple case of lnearly separable patterns except for a mnor but mportant dfference. The objectve functon Q(α) to be maxmzed s the same n both cases. The nonseparable case dffers from the separable case n that the constrant α 0sreplacedwththemorestrngentconstrantC α 0. Except for ths modfcaton, the constraned optmzaton for the nonseparable case and computatons of the optmum values of the weght vector w and bas b proceed n the same way as n the lnearly separable case. Just as the hard-margn SVM, α consttute a dual representaton for the weght vector such that w = m s =1 α y x (29) where m s s the number of support vectors whose correspondng coeffcent α > 0. The determnaton of the optmum values of the bas also follows a procedure smlar to that descrbed before. Once α and b are computed, the functon Eq.(22) s used to classfy new object. We can further dsclose relatonshps among α, ξ, and C by the Kuhn-Tucker condtons whch are defned by and α {y (w x b) 1 + ξ } = 0, = 1,2,...,m (30)

8 Hwanjo Yu and Sungchul Km µ ξ = 0, = 1,2,...,m (31) Eq.(30) s a rewrte of Eq.(20) except for the replacement of the unty term (1 ξ ).AsforEq.(31),theµ are Lagrange multplers that have been ntroduced to enforce the nonnegatvty of the slack varables ξ for all. Atthesaddlepontthe dervatve of the Lagrange functon for the prmal problem wth respect to the slack varable ξ s zero, the evaluaton of whch yelds By combnng Eqs.(31) and (32), we see that α + µ = C (32) ξ = 0fα < C, and (33) ξ 0fα = C (34) We can graphcally dsplay the relatonshps among α, ξ,andc n Fg. 3. Fg. 3 Graphcal relatonshps among α, ξ,andc Data ponts outsde the margn wll have α = 0andξ = 0andthoseonthemargn lne wll have C > α > 0andstllξ = 0. Data ponts wthn the margn wll have α = C.Amongthem,thosecorrectlyclassfedwllhave1> ξ > 0andmsclassfed ponts wll have ξ > 1. 2.3 Kernel Trck for Nonlnear Classfcaton If the tranng data s not lnearly separable, there s no straght hyperplane that can separate the classes. In order to learn a nonlnear functon n that case, lnear SVMs must be extended to nonlnear SVMs for the classfcaton of nonlnearly separable

SVM Tutoral: Classfcaton, Regresson, and Rankng 9 data. The process of fndng classfcaton functons usng nonlnear SVMs conssts of two steps. Frst, the nput vectors are transformed nto hgh-dmensonal feature vectors where the tranng data can be lnearly separated. Then, SVMs are used to fnd the hyperplane of maxmal margn n the new feature space.the separatng hyperplane becomes a lnear functon n the transformed feature space but a nonlnear functon n the orgnal nput space. Let x be a vector n the n-dmensonal nput space and ϕ( ) be a nonlnear mappng functon from the nput space to the hgh-dmensonal feature space. The hyperplane representng the decson boundary n the feature space s defned as follows. w ϕ(x) b = 0 (35) where w denotes a weght vector that can map the tranng data n the hgh dmensonal feature space to the output space, and b s the bas. Usng the ϕ( ) functon, the weght becomes w = α y ϕ(x ) (36) The decson functon of Eq.(22) becomes F(x)= m α y ϕ(x ) ϕ(x) b (37) Furthermore, the dual problem of soft-margn SVM (Eq.(26)) can be rewrtten usng the mappng functon on the data vectors as follows. Q(α)=α 1 2 α α j y y j ϕ(x ) ϕ(x j ) (38) j holdng the same constrants. Note that the feature mappng functons n the optmzaton problem and also n the classfyng functon always appear as dot products, e.g., ϕ(x ) ϕ(x j ). ϕ(x ) ϕ(x j ) s the nner product between pars of vectors n the transformed feature space. Computng the nner product n the transformed feature space seems to be qute complex and suffer from the curse of dmensonalty problem. To avod ths problem, the kernel trcks used. The kernel trckreplaces the nner product n the feature space wth a kernel functon K n the orgnal nput space as follows. K(u,v)=ϕ(u) ϕ(v) (39) The Mercer s theorem proves that a kernel functon K s vald, f and only f, the followng condtons are satsfed, for any functon ψ(x).(referto[9]fortheproof n detal.) K(u, v)ψ(u)ψ(v)dxdy 0 (40) where ψ(x) 2 dx 0

10 Hwanjo Yu and Sungchul Km The Mercer s theorem ensures that the kernel functon can be always expressed as the nner product between pars of nput vectors n some hgh-dmensonal space, thus the nner product can be calculated usng the kernel functon only wth nput vectors n the orgnal space wthout transformng the nput vectors nto the hghdmensonal feature vectors. The dual problem s now defned usng the kernel functon as follows. maxmze: Q 2 (α)= subject to: The classfcaton functon becomes: α α α j y y j K(x,x j ) (41) j α y = 0 (42) C α 0 (43) F(x)=α y K(x,x) b (44) Snce K( ) s computed n the nput space, no feature transformaton wll be actually done or no ϕ( ) wll be computed, and thus the weght vector w = α y ϕ(x) wll not be computed ether n nonlnear SVMs. The followngs are popularly used kernel functons. Polynomal: K(a,b)=(a b + 1) d Radal Bass Functon (RBF): K(a,b)=exp( γ a b 2 ) Sgmod: K(a,b)=tanh(κa b + c) Note that, the kernel functon s a knd of smlarty functon between two vectors where the functon output s maxmzed when the two vectors become equvalent. Because of ths, SVM can learn a functon from any shapes of data beyond vectors (such as trees or graphs) as long as we can compute a smlarty functonbetween any pars of data objects. Further dscussons on the propertes of these kernel functons are out of the scope. We wll nstead gve an example of usng polynomal kernel for learnng an XOR functon n the followng secton. 2.3.1 Example: XOR problem To llustrate the procedure of tranng a nonlnear SVM functon, assume we are gven a tranng set of Table 1. Fgure 4 plots the tranng ponts n the 2-D nput space. There s no lnear functon that can separate the tranng ponts. To proceed, let K(x,x )=(1 + x x ) 2 (45) If we denote x =(x 1,x 2 ) and x =(x 1,x 2 ),thekernelfunctonsexpressedn terms of monomals of varous orders as follows.

SVM Tutoral: Classfcaton, Regresson, and Rankng 11 Input vector x Desred output y (-1, -1) -1 (-1, +1) +1 (+1, -1) +1 (+1, +1) -1 Table 1 XOR Problem Fg. 4 XOR Problem K(x,x )=1 + x 2 1x 2 1 + 2x 1 x 2 x 1 x 2 + x 2 2x 2 2 + 2x 1 x 1 + 2x 2 x 2 (46) The mage of the nput vector x nduced n the feature space s therefore deduced to be ϕ(x)=(1,x 2 1, 2x 1 x 2,x 2 2, 2x 1, 2x 2 ) (47) Based on ths mappng functon, the objectve functon for the dual form can be derved from Eq. (41) as follows. Q(α) =α 1 + α 2 + α 3 + α 4 1 2 (9α2 1 2α 1 α 2 2α 1 α 3 + 2α1α 4 +9α 2 2 + 2α 2 α 3 2α 2 α 4 + 9α 3 2α3α 4 + α 2 4 ) (48) Optmzng Q(α) wth respect to the Lagrange multplers yelds the followng set of smultaneous equatons:

12 Hwanjo Yu and Sungchul Km 9α 1 α 2 α 3 + α 4 = 1 α 1 + 9α 2 + α 3 α 4 = 1 α 1 + α 2 + 9α 3 α 4 = 1 α 1 α 2 α 3 + 9α 4 = 1 Hence, the optmal values of the Lagrange multplers are α 1 = α 2 = α 3 = α 4 = 1 8 Ths result denotes that all four nput vectors are support vectors and the optmum value of Q(α) s and Q(α)= 1 4 1 2 w 2 = 1 4, or w = 1 2 From Eq.(36), we fnd that the optmum weght vector s w = 1 8 [ ϕ(x 1)+ϕ(x 2 )+ϕ(x 3 ) ϕ(x 4 )] 1 1 1 1 0 = 1 1 1 8 2 1 + 1 2 1 2 + 1 0 2 1 2 1 = 0 2 2 2 2 2 0 2 2 0 1 2 (49) The bas b s 0 because the frst element of w s 0. The optmal hyperplane becomes whch reduces to w ϕ(x)=[0 0 1 x 2 1 1 0 0 0] 2x1 x 2 2 x 2 2 = 0 (50) 2x1 22 x 1 x 2 = 0 (51)

SVM Tutoral: Classfcaton, Regresson, and Rankng 13 x 1 x 2 = 0stheoptmalhyperplane,thesolutonoftheXORproblem.It makes the output y = 1forbothnputpontsx 1 = x 2 = 1andx 1 = x 2 = 1, and y = 1for both nput ponts x 1 = 1,x 2 = 1 orx 1 = 1,x 2 = 1. Fgure. 5 represents the four ponts n the transformed feature space. Fg. 5 The 4 data ponts of XOR problem n the transformed feature space 3SVMRegresson SVM Regresson (SVR) s a method to estmate a functon that maps from an nput object to a real number based on tranng data. Smlarly to the classfyng SVM, SVR has the same propertes of the margn maxmzaton and kernel trck for nonlnear mappng. Atranngsetforregressonsrepresentedasfollows. D = {(x 1,y 1 ),(x 2,y 2 ),...,(x m,y m )} (52) where x s a n-dmensonal vector, y s the real number for each x.thesvrfuncton F(x ) makes a mappng from an nput vector x to the target y and takes the form. F(x)= w x b (53) where w s the weght vector and b s the bas. The goal s to estmate the parameters (w and b) ofthefunctonthatgvethebestftofthedata.ansvrfuncton F(x)

14 Hwanjo Yu and Sungchul Km approxmates all pars (x, y )whlemantanngthedfferencesbetweenestmated values and real values under ε precson. That s, for every nput vector x n D, The margn s y w x b ε (54) w x + b y ε (55) margn = 1 w By mnmzng w 2 to maxmze the margn, the tranng n SVR becomes a constraned optmzaton problem as follows. (56) mnmze: L(w)= 1 2 w 2 (57) subject to: y w x b ε (58) w x + b y ε (59) The soluton of ths problem does not allow any errors. To allow some errors to deal wth nose n the tranng data, The soft margn SVR uses slack varables ξ and ˆξ.Then,theoptmzatonproblemcanberevsedasfollows. mnmze: L(w,ξ )= 2 1 w 2 +C (ξ 2, ˆξ 2 ), C > 0 (60) subject to: y w x b ε + ξ, (x,y ) D (61) w x + b y ε + ˆξ, (x,y ) D (62) ξ, ˆξ 0 (63) The constant C > 0sthetrade-offparameterbetweenthemargnszeandthe amount of errors. The slack varables ξ and ˆξ deal wth nfeasble constrants of the optmzaton problem by mposng the penalty to the excess devatons whch are larger than ε. To solve the optmzaton problem Eq.(60), we can construct alagrange functon from the objectve functon wth Lagrange multplers as follows:

SVM Tutoral: Classfcaton, Regresson, and Rankng 15 mnmze: L = 2 1 w 2 +C (ξ + ˆξ ) (η ξ + ˆη ˆξ ) (64) α (ε + η y + w x + b) ˆα (ε + ˆη + y w x b) subject to: η, ˆη 0 (65) α, ˆα 0 (66) where η, ˆη,α, ˆα are the Lagrange multplers whch satsfy postve constrants. The followng s the process to fnd the saddle pont by usng the partal dervatves of L wth respect to each lagrangan multplers for mnmzng the functon L. L b = (α ˆα )=0 (67) L w = w Σ(α ˆα )x = 0,w = (α ˆα )x (68) L ˆξ = C ˆα ˆη = 0, ˆη = C ˆα (69) The optmzaton problem wth nequalty constrants can be changed to followng dual optmzaton problem by substtutng Eq. (67), (68) and(69)nto(64). maxmze: L(α)= subject to: 1 2 y (α ˆα ) ε (α + ˆα ) (70) (α ˆα )(α ˆα )x x j (71) j (α ˆα )=0 (72) 0 α, ˆα C (73) The dual varables η, ˆη are elmnated n revsng Eq. (64) nto Eq. (70). Eq. (68) and (68) can be rewrtten as follows. w = (α ˆα )x (74) η = C α (75) ˆη = C ˆα (76) where w s represented by a lnear combnaton of the tranng vectors x.accordngly, the SVR functon F(x) becomes the followng functon. F(x)= (α ˆα )x x + b (77)

16 Hwanjo Yu and Sungchul Km Eq.(77) can map the tranng vectors to target real values wth allowng some errors but t cannot handle the nonlnear SVR case. The same kernel trck can be appled by replacng the nner product of two vectors x,x j wth a kernel functon K(x,x j ).Thetransformedfeaturespacesusuallyhghdmensonal,andtheSVR functon n ths space becomes nonlnear n the orgnal nput space. Usng the kernel functon K, The nner product n the transformed feature space can be computed as fast as the nner product x x j n the orgnal nput space. The same kernel functons ntroduced n Secton 2.3 can be appled here. Once replacng the orgnal nner product wth a kernel functon K,the remanng process for solvng the optmzaton problem s very smlar to that for the lnear SVR. The lnear optmzaton functon can be changed by usng kernel functon as follows. maxmze: L(α)= subject to: 1 2 y (α ˆα ) ε (α + ˆα ) (α ˆα )(α ˆα )K(x,x j ) (78) j (α ˆα )=0 (79) ˆα 0,α 0 (80) 0 α, ˆα C (81) Fnally, the SVR functon F(x) becomes the followng usng the kernel functon. F(x)= ( ˆα α )K(x,x)+b (82) 4SVMRankng Rankng SVM, learnng a rankng (or preference) functon, has produced varous applcatons n nformaton retreval [14, 16, 28]. The task of learnng rankng functons s dstngushed from that of learnng classfcaton functons as follows: 1. Whle a tranng set n classfcaton s a set of data objects andtherclasslabels, n rankng, atranngsetsanorderngofdata.let A s preferred to B bespecfedas A B. A tranng set for rankng SVM s denoted as R = {(x 1,y ),...,(x m,y m )} where y s the rankng of x,thats,y < y j f x x j. 2. Unlke a classfcaton functon, whch outputs a dstnct class for a data object, arankngfunctonoutputsascore for each data object, from whch a global orderng of data s constructed. That s, the target functon F(x ) outputs a score such that F(x ) > F(x j ) for any x x j. If not stated, R s assumed to be strct orderng, whch means that for all pars x and x j n a set D, etherx R x j or x R x j.however,tcanbestraghtforwardly

SVM Tutoral: Classfcaton, Regresson, and Rankng 17 generalzed to weak orderngs. Let R be the optmal rankng of data n whch the data s ordered perfectly accordng to user s preference. A rankng functon F s typcally evaluated by how closely ts orderng R F approxmates R. Usng the technques of SVM, a global rankng functon F can be learned from an orderng R.Fornow,assumeF s a lnear rankng functon such that: {(x,x j ) : y < y j R} : F(x ) > F(x j ) w x > w x j (83) Aweghtvectorw s adjusted by a learnng algorthm. We say an orderngs R s lnearly rankable f there exsts a functon F (represented by a weght vector w)that satsfes Eq.(83) for all {(x,x j ) : y < y j R}. The goal s to learn F whch s concordant wth the orderng R and also generalze well beyond R. Thatstofndtheweghtvectorw such that w x > w x j for most data pars {(x,x j ) : y < y j R}. Though ths problem s known to be NP-hard [10], The soluton can be approxmated usng SVM technques by ntroducng (non-negatve) slack varables ξ j and mnmzng the upper bound ξ j as follows [14]: mnmze: L 1 (w,ξ j )= 1 2 w w +C ξ j (84) subject to: {(x,x j ) : y < y j R} : w x w x j + 1 ξ j (85) (, j) : ξ j 0 (86) By the constrant (85) and by mnmzng the upper bound ξ j n (84), the above optmzaton problem satsfes orderngs on the tranng set R wth mnmal error. By mnmzng w w or by maxmzng the margn (= 1 w ), t tres to maxmze the generalzaton of the rankng functon. We wll explan how maxmzng the margn corresponds to ncreasng the generalzaton of rankng n Secton 4.1. C s the soft margn parameter that controls the trade-off between the margn sze and tranng error. By rearrangng the constrant (85) as w(x x j ) 1 ξ j (87) The optmzaton problem becomes equvalent to that of classfyng SVM on parwse dfference vectors (x x j ). Thus, we can extend an exstng SVM mplementaton to solve the problem. Note that the support vectors are the data pars (x s,xs j ) such that constrant (87) s satsfed wth the equalty sgn,.e., w(x s xs j )=1 ξ j.unboundedsupport vectors are the ones on the margn (.e., ther slack varables ξ j = 0), and bounded support vectors are the ones wthn the margn (.e., 1 > ξ j > 0) or msranked (.e., ξ j > 1). As done n the classfyng SVM, a functon F n rankng SVM s also expressed only by the support vectors. Smlarly to the classfyng SVM, the prmal problem of rankng SVM can be transformed to the followng dual problem usng the Lagrange multplers.

18 Hwanjo Yu and Sungchul Km maxmze: L 2 (α)=α j α j α uv K(x x j,x u x v ) (88) j j uv subject to: C α 0 (89) Once transformed to the dual, the kernel trck can be appled to support nonlnear rankng functon. K( ) s a kernel functon. α j s a coeffcent for a parwse dfference vectors (x x j ).NotethatthekernelfunctonscomputedforP 2 ( m 4 ) tmes where P s the number of data pars and m s the number of data ponts n the tranng set, thus solvng the rankng SVM takes O(m 4 ) at least. Fast tranng algorthms for rankng SVM have been proposed [17] but they are lmted to lnear kernels. Once α s computed, w can be wrtten n terms of the parwse dfference vectors and ther coeffcents such that: w = α j (x x j ) (90) j The rankng functon F on a new vector z can be computed usng the kernel functon replacng the dot product as follows: F(z)=w z = j α j (x x j ) z = α j K(x x j,z). (91) j 4.1 Margn-Maxmzaton n Rankng SVM Fg. 6 Lnear projecton of four data ponts We now explan the margn-maxmzaton of the rankng SVM, to reason about how the rankng SVM generates a rankng functon of hgh generalzaton. Wefrst establsh some essental propertes of rankng SVM. For convenence of explana-

SVM Tutoral: Classfcaton, Regresson, and Rankng 19 ton, we assume a tranng set R s lnearly rankable and thus we use hard-margn SVM,.e., ξ j = 0forall(, j) n the objectve (84) and the constrants (85). In our rankng formulaton, from Eq.(83), the lnear rankngfunctonf w projects data vectors onto a weght vector w.fornstance,fg.6llustrateslnearprojectons of four vectors {x 1,x 2,x 3,x 4 } onto two dfferent weght vectors w 1 and w 2 respectvely n a two-dmensonal space. Both F x1 and F x2 make the same orderng R for the four vectors, that s, x 1 > R x 2 > R x 3 > R x 4.Therankngdfferenceoftwovectors (x,x j )accordngtoarankngfunctonf w s denoted by the geometrc dstance of the two vectors projected onto w, thats,formulatedas w(x x j ) w. Corollary 1. Suppose F w s a rankng functon computed by the hard-margn rankng SVM on an orderng R. Then, the support vectors of F w represent the data pars that are closest to each other when projected to w thus closest n rankng. Proof. The support vectors are the data pars (x s,xs j )suchthatw(xs xs j )=1n constrant (87), whch s the smallest possble value for all datapars (x,x j ) R. Thus, ts rankng dfference accordng to F w (= w(xs xs j ) w ) s also the smallest among them [24]. Corollary 2. The rankng functon F, generated by the hard-margn rankng SVM, maxmzes the mnmal dfference of any data pars n rankng. Proof. By mnmzng w w, therankngsvmmaxmzesthemargnδ = 1 w = w(x s xs j ) w where (x s,xs j ) are the support vectors, whch denotes, from the proof of Corollary 1, the mnmal dfference of any data pars n rankng. The soft margn SVM allows bounded support vectors whose ξ j > 0aswellas unbounded support vectors whose ξ j = 0, n order to deal wth nose and allow small error for the R that s not completely lnearly rankable. However, the objectve functon n (84) also mnmzes the amount of the slacks and thus the amount of error, and the support vectors are the close data pars n rankng. Thus, maxmzng the margn generates the effect of maxmzng the dfferences of close data pars n rankng. From Corollary 1 and 2, we observe that rankng SVM mproves the generalzaton performance by maxmzng the mnmal rankng dfference. For example, consder the two lnear rankng functons F w1 and F w2 n Fg. 6. Although the two weght vectors w 1 and w 2 make the same orderng, ntutvely w 1 generalzes better than w 2 because the dstance between the closest vectors on w 1 (.e., δ 1 )slarger than that on w 2 (.e., δ 2 ). SVM computes the weght vector w that maxmzes the dfferences of close data pars n rankng. Rankng SVMs fnd arankngfunctonof hgh generalzaton n ths way.

20 Hwanjo Yu and Sungchul Km 5 Rankng Vector Machne: An Effcent Method for Learnng the 1-norm Rankng SVM Ths secton presents another rank learnng method, Rankng Vector Machne (RVM), arevsed1-normrankngsvmthatsbetterforfeatureselecton and more scalable to large data sets than the standard rankng SVM. We frst develop a 1-norm rankng SVM, a rankng SVM that s based on 1-norm objectve functon. (The standard rankng SVM s based on 2-norm objectve functon.) The 1-norm rankng SVM learns a functon wth much less supportvectors than the standard SVM. Thereby, ts testng tme s much faster than 2-norm SVMs and provdes better feature selecton propertes. (The functon of 1-norm SVM s lkely to utlze a less number of features by usng a less number of support vectors [11].) Feature selectonsalsomportant nrankng. Rankng functons are relevance or preference functons n document or data retreval.identfyng key features ncreases the nterpretablty of the functon. Feature selecton for nonlnear kernel s especally challengng, and the fewer the number of support vectors are, the more effcently feature selecton can be done [12, 20, 6, 30, 8]. We next present RVM whch revses the 1-norm rankng SVM for fast tranng. The RVM trans much faster than standard SVMs whle not compromsng the accuracy when the tranng set s relatvely large. The key dea of RVM s to express the rankng functon wth rankng vectors nstead of support vectors. Support vectors n rankng SVMs are parwse dfference vectors of the closest pars as dscussed n Secton 4. Thus, the tranng requres nvestgatng every data par as potental canddates of support vectors, and the number of data pars are quadratc to the sze of tranng set. On the other hand, the rankng functon of the RVM utlzes each tranng data object nstead of data pars. Thus, the number of varables for optmzaton s substantally reduced n the RVM. 5.1 1-norm Rankng SVM The goal of 1-norm rankng SVM s the same as that of the standard rankng SVM, that s, to learn F that satsfes Eq.(83) for most {(x,x j ) : y < y j R} and generalze well beyond the tranng set. In the 1-norm rankng SVM, we express Eq.(83) usng the F of Eq.(91) as follows. F(x u ) > F(x v )= = P j P j α j (x x j ) x u > P j α j (x x j ) x v (92) α j (x x j ) (x u x v ) > 0 (93) Then, replacng the nner product wth a kernel functon, the 1-norm rankng SVM s formulated as:

SVM Tutoral: Classfcaton, Regresson, and Rankng 21 mnmze : L(α,ξ )= P α j +C P ξ j (94) j j s.t. : P α j K(x x j,x u x v ) 1 ξ uv, {(u,v) : y u < y v R} (95) j α 0, ξ 0 (96) Whle the standard rankng SVM suppresses the weght w to mprove the generalzaton performance, the 1-norm rankng suppresses α n the objectve functon. Snce the weght s expressed by the sum of the coeffcent tmes parwse rankng dfference vectors, suppressng the coeffcent α corresponds to suppressng the weght w n the standard SVM. (Mangasaran proves t n [18].) C s a user parameter controllng the tradeoff between the margn sze and the amount of error, ξ,and K s the kernel functon. P s the number of parwse dfference vectors ( m 2 ). The tranng of the 1-norm rankng SVM becomes a lnear programmng (LP) problem thus solvable by LP algorthms such as the Smplex and Interor Pont method [18, 11, 19]. Just as the standard rankng SVM, K needs to be computed P 2 ( m 4 )tmes,andtherearep number of constrants (95) and α to compute. Once α s computed, F s computed usng the same rankng functon as the standard rankng SVM,.e., Eq.(91). The accuraces of 1-norm rankng SVM and standard rankng SVMarecomparable, and both methods need to compute the kernel functon O(m 4 ) tmes. In practce, the tranng of the standard SVM s more effcent because fast decompostonalgorthms have been developed such as sequental mnmal optmzaton (SMO) [21] whle the 1-norm rankng SVM uses common LP solvers. It s shown that 1-norm SVMs use much less support vectors that standard 2- norm SVMs, that s, the number of postve coeffcents(.e., α > 0) after tranng s much less n the 1-norm SVMs than n the standard 2-norm SVMs [19,11].It s because, unlke the standard 2-norm SVM, the support vectors n the 1-norm SVM are not bounded to those close to the boundary n classfcaton or the mnmal rankng dfference vectors n rankng. Thus, the testng nvolves much less kernel evaluatons, and t s more robust when the tranng set contans nosy features [31]. 5.2 Rankng Vector Machne Although the 1-norm rankng SVM has merts over the standard rankng SVM n terms of the testng effcency and feature selecton, ts tranng complexty s very hgh w.r.t. the number of data ponts. In ths secton, we present Rankng Vector Machne (RVM), whch revses the 1-norm rankng SVM to reduce the tranng tme substantally. The RVM sgnfcantly reduces the number ofvarablesnthe optmzaton problem whle not compromzng the accuracy. The key dea of RVM s to express the rankng functon wth rankng vectors nstead of support vectors.

22 Hwanjo Yu and Sungchul Km The support vectors n rankng SVMs are chosen from parwse dfference vectors, and the number of parwse dfference vectors are quadratc to the sze of tranng set. On the other hand, the rankng vectors are chosen from thetranngvectors,thus the number of varables to optmze s substantally reduced. To theoretcally justfy ths approach, we frst present the Representer Theorem. Theorem 1 (Representer Theorem [22]). Denote by Ω: [0, ) R astrctlymonotoncncreasngfuncton,byx aset, and by c : (X R 2 ) m R { } an arbtrary loss functon. Then each mnmzer F H of the regularzed rsk c((x 1,y 1,F(x 1 )),...,(x m,y m,f(x m ))) + Ω( F H ) (97) admts a representaton of the form F(x)= m =1 α K(x,x) (98) The proof of the theorem s presented n [22]. Note that, n the theorem, the loss functon c s arbtrary allowng couplng between data ponts (x,y ),andtheregularzerω has to be monotonc. Gven such a loss functon and regularzer, the representer theorem states that although we mght be tryng to solve the optmzaton problem n an nfntedmensonal space H,contanng lnear combnatons of kernels centered on arbtrary ponts of X, the soluton les n the span of m partcular kernels those centered on the tranng ponts [22]. Based on the theorem, we defne our rankng functon F as Eq.(98), whch s based on the tranng ponts rather than arbtrary ponts (or parwse dfference vectors). Functon (98) s smlar to functon (91) except that, unlkethelatterusng parwse dfference vectors (x x j )andthercoeffcents(α j ), the former utlzes the tranng vectors (x )andthercoeffcents(α ). Wth ths functon, Eq.(92) becomes the followng. F(x u ) > F(x v )= = m m α K(x,x u ) > Thus, we set our loss functon c as follows. c = {(u,v):y u <y v R} (1 m α K(x,x v ) (99) α (K(x,x u ) K(x,x v )) > 0. (100) m α (K(x,x u ) K(x,x v ))) (101) The loss functon utlzes couples of data ponts penalzng msrankedpars,that s, t returns hgher values as the number of msranked pars ncreases. Thus, the loss functon s order senstve, and t s an nstance of the functon class c n Eq.(97).

SVM Tutoral: Classfcaton, Regresson, and Rankng 23 We set the regularzer Ω( f H )= m α (α 0), whch s strctly monotoncally ncreasng. Let P s the number of pars (u,v) R such that y u < y v,andletξ uv = 1 m α (K(x,x u ) K(x,x v )).Then,ourRVMsformulatedasfollows. mnmze: L(α,ξ )= m α +C P ξ j (102) j s.t.: m α (K(x,x u ) K(x,x v )) 1 ξ uv, {(u,v) : y u < y v R} (103) α,ξ 0 (104) The soluton of the optmzaton problem les n the span of kernels centered on the tranng ponts (.e., Eq.(98)) as suggested n the representer theorem. Just as the 1-norm rankng SVM, the RVM suppresses α to mprove the generalzaton, and forces Eq.(100) by constrant (103). Note that there are only m number of α n the RVM. Thus, the kernel functon s evaluated O(m 3 ) tmes whle the standard rankng SVM computes t O(m 4 ) tmes. Another ratonale of RVM or ratonale of usng tranng vectors nstead of parwse dfference vectors n the rankng functon s that the support vectors n the 1-norm rankng SVM are not the closest parwse dfference vectors, thus expressng the rankng functon wth parwse dfference vectors becomes not as benefcal n the 1-norm rankng SVM. To explan ths further,consder classfyng SVMs. Unlke the 2-norm (classfyng) SVM, the support vectors n the 1-norm(classfyng) SVM are not lmted to those close to the decson boundary. Ths makes t possble that the 1-norm (classfyng) SVM can express the smlar boundary functon wth less number of support vectors. Drectly extended from the 2-norm (classfyng) SVM, the 2-norm rankng SVM mproves the generalzaton by maxmzng the closest parwse rankng dfference that corresponds tothemargnnthe2-norm (classfyng) SVM as dscussed n Secton 4. Thus, the 2-norm rankngsvmexpresses the functon wth the closest parwse dfference vectors (.e., the support vectors). However, the 1-norm rankng SVM mproves the generalzaton by suppressng the coeffcents α just as the 1-norm (classfyng) SVM. Thus, the support vectors n the 1-norm rankng SVM are not the closest parwse dfference vectors any more, and thus expressng the rankng functon wth parwse dfference vectors becomes not as benefcal n the 1-norm rankng SVM. 5.3 Experment Ths secton evaluates the RVM on synthetc datasets (Secton 5.3.1) and a realworld dataset (Secton 5.3.2). The RVM s compared wth the state-of-the-art rankng SVM provded n SVM-lght. Experment results show that the RVM trans substantally faster than the SVM-lght for nonlnear kernels whle ther accura-

24 Hwanjo Yu and Sungchul Km ces are comparable. More mportantly, the number of rankng vectors n the RVM s multple orders of magntudes smaller than the number of support vectors n the SVM-lght. Experments are performed on a Wndows XP Professonal machne wth a Pentum IV 2.8GHz and 1GB of RAM. We mplemented the RVM usng CandusedCPLEX 1 for the LP solver. The source codes are freely avalable at http://s.postech.ac.kr/rvm [29]. Evaluaton metrc: MAP (mean average precson) s used to measure the rankng qualty when there are only two classes of rankng [26], and NDCG s used to evaluate rankng performance for IR applcatons when there are multple levels of rankng [2, 4, 7, 25]. Kendall s τ s used when there s a global orderng of data and the tranng data s a subset of t. Rankng SVMs as well as the RVM mnmze the amount of error or ms-rankng, whch s correspondng to optmzng the Kendall s τ [16, 27]. Thus, we use the Kendall s τ to compare ther accuracy. Kendall s τ computes the overall accuracy by comparng the smlarty of two orderngs R and R F.(R F s the orderng of D accordng to the learned functon F.) The Kendall s τ s defned based on the number of concordant pars and dscordant pars. If R and R F agree n how they order a par, x and x j,theparsconcordant, otherwse, t s dscordant. The accuracy of functon F s defned as the number of concordant pars between R and R F per the total number of pars n D as follows. F(R,R F )= #ofconcordantpars ( ) R 2 For example, suppose R and R F order fve ponts x 1,...,x 5 as follow: (x 1,x 2,x 3,x 4,x 5 ) R (x 3,x 2,x 1,x 4,x 5 ) R F Then, the accuracy of F s 0.7, as the number of dscordant pars s 3,.e.,{x 1,x 2 },{x 1,x 3 },{x 2,x 3 } whle all remanng 7 pars are concordant. 5.3.1 Experments on Synthetc Dataset Below s the descrpton of our experments on synthetc datasets. 1. We randomly generated a tranng and a testng dataset D tran and D test respectvely, where D tran contans m tran (= 40, 80, 120, 160, 200) data ponts of n (e.g., 5) dmensons (.e., m tran -by-n matrx), and D test contans m test (= 50) data ponts of n dmensons (.e., m test -by-n matrx). Each element n the matrces s a random number between zero and one. (We only dd experments on the data set 1 http://www.log.com/products/cplex/

SVM Tutoral: Classfcaton, Regresson, and Rankng 25 1 0.99 0.98 1 0.99 0.98 Kendall s tau 0.97 0.96 0.95 0.94 0.93 SVM (Lnear) RVM (Lnear) Kendall s tau 0.97 0.96 0.95 0.94 0.93 SVM (RBF) RVM (RBF) 0.92 0.91 0.9 40 60 80 100 120 140 160 180 200 Sze of tranng set (a) Lnear 0.92 0.91 0.9 40 60 80 100 120 140 160 180 200 Sze of tranng set (b) RBF Fg. 7 Accuracy 12 30 Tranng tme n seconds 10 8 6 4 2 SVM (Lnear) RVM (Lnear) Tranng tme n seconds 25 20 15 10 5 SVM (RBF) RVM (RBF) 0 40 60 80 100 120 140 160 180 200 Sze of tranng set (a) Lnear Kernel 0 40 60 80 100 120 140 160 180 200 Sze of tranng set (b) RBF Kernel Fg. 8 Tranng tme 200 200 Number of support vectors 150 100 50 SVM (Lnear) RVM (Lnear) Number of support vectors 150 100 50 SVM (RBF) RVM (RBF) 0 40 60 80 100 120 140 160 180 200 Sze of tranng set (a) Lnear Kernel 0 40 60 80 100 120 140 160 180 200 Sze of tranng set (b) RBF Kernel Fg. 9 Number of support (or rankng) vectors of up to 200 objects due to performance reason. Rankng SVMs run ntolerably slow on data sets larger than 200.) 2. We randomly generate a global rankng functon F,byrandomlygeneratngthe weght vector w n F (x)=w x for lnear, and n F (x)=exp( w x ) 2 for RBF functon.

26 Hwanjo Yu and Sungchul Km 0.2 0.2 Decrement n accuracy 0.15 0.1 0.05 SVM (Lnear) RVM (Lnear) Decrement n accuracy 0.15 0.1 0.05 SVM (RBF) RVM (RBF) 0 0 1 2 3 4 5 6 k (a) Lnear 0 0 1 2 3 4 5 6 k (b) RBF Fg. 10 Senstvty to nose (m tran = 100). 3. We rank D tran and D test accordng to F,whchformstheglobalorderngR tran and R test on the tranng and testng data. 4. We tran a functon F from R tran,andtesttheaccuracyoff on R test. We tuned the soft margn parameter C by tryng C = 10 5, 10 5,...,10 5,and used the hghest accuracy for comparson. For the lnear and RBF functons, we used lnear and RBF kernels accordngly. We repeat ths entre process 30 tmes to get the mean accuracy. Accuracy: Fgure 7 compares the accuraces of the RVM and the rankng SVM from the SVM-lght. The rankng SVM outperforms RVM when the sze of data set s small, but ther dfference becomes trval as the sze of data set ncreases. Ths phenomenon can be explaned by the fact that when the tranng szestoo small,the number of potental rankng vectors becomes too small to drawanaccuraterankng functon whereas the number of potental support vectors s stll large. However, as the sze of tranng set ncreases, RVM becomes as accurate as therankngsvm because the number of potental rankng vectors becomes large as well. Tranng Tme: Fgure 8 compares the tranng tme of the RVM and the SVMlght. Whle the SVM lght trans much faster than RVM for lnear kernel (SVM lght s specally optmzed for lnear kernel.), the RVM trans sgnfcantly faster than the SVM lght for RBF kernel. Number of Support (or Rankng) Vectors: Fgure 9 compares the number of support (or rankng) vectors used n the functon of RVM and the SVM-lght. RVM s model uses a sgnfcantly smaller number of support vectors than the SVM-lght. Senstvty to nose: In ths experment, we compare the senstvty of each method to nose. We nsert nose by swtchng the orders of some data pars n R tran.we set the sze of tranng set m tran = 100 and the dmenson n = 5. After we make R tran from a random functon F,werandomlypckedk vectors from the R tran and swtched t wth ts adjacent vector n the orderng to mplant nose n the tranng set. Fgure10showsthedecrements oftheaccuraces asthenumber of msorderngs

SVM Tutoral: Classfcaton, Regresson, and Rankng 27 ncreases n the tranng set. Ther accuraces are moderately decreasng as the nose ncreases n the tranng set, and ther senstvtes to nose are comparable. 5.3.2 Experment on Real Dataset In ths secton, we experment usng the OHSUMED dataset obtaned from the LETOR, the ste contanng benchmark datasets for rankng [1]. OHSUMED s a collecton of documents and queres on medcne, consstng of 348,566 references and 106 queres. There are n total 16,140 query-document pars upon whch relevance judgements are made. In ths dataset the relevance judgements have three levels: defntely relevant, partally relevant, and rrelevant. The OHSUMED dataset n the LETOR extracts 25 features. We report our experments on the frst three queres and ther documents. We compare the performance of RVM and SVMlght on them. We tuned the parameters 3-fold cross valdaton wth tryng C and γ = 10 6,10 5,...,10 6 for the lnear and RBF kernels and compared the hghest performance. The tranng tme s measured for tranng the model wth the tuned parameters. We repeated the whole process three tmes and reported the mean values. query 1 query 2 query 3 D = 134 D = 128 D = 182 Acc Tme #SV or #RV Acc Tme #SV or #RV Acc Tme #SV or #RV lnear.5484.23 1.4.6730.41 3.83.6611 1.94 1.99 RVM RBF.5055.85 4.3.6637.41 2.83.6723 4.71 1 lnear.5634 1.83 92.6723 1.03 101.66.6588 4.24 156.55 SVM RBF.5490 3.05 92.6762 3.50 102.6710 55.08 156.66 Table 2 Experment results: Accuracy (Acc), Tranng Tme (Tme), and Number of Support or Rankng Vectors (#SV or #RV) Table 5.3.2 show the results. The accuraces of the SVM and RVMarecomparable overall; SVM shows a lttle hgh accuracy than RVM for query 1, but for the other queres, ther accuracy dfferences are not statstcally sgnfcant. More mportantly, the number of rankng vectors n RVM s sgnfcantly smaller than that of support vectors n SVM. For example, for query 3, the RVM havng just one rankng vector outperformed the SVM wth over 150 support vectors. The tranng tme of RVM s sgnfcantly shorter than that of SVM-lght. References 1. Letor: Learnng to rank for nformaton retreval. Http://research.mcrosoft.com/users/LETOR/ 2. Baeza-Yates, R., Rbero-Neto, B. (eds.): Modern Informaton Retreval. ACM Press (1999) 3. Bertsekas, D.P.: Nonlnear Programmng. Athena Scentfc (1995)

28 Hwanjo Yu and Sungchul Km 4. Burges, C., Shaked, T., Renshaw, E., Lazer, A., Deeds, M., Hamlton, N., Hullender, G.: Learnng to rank usng gradent descent. In: Proc. Int. Conf. Machne Learnng (ICML 04) (2004) 5. Burges, C.J.C.: A tutoral on support vector machnes for pattern recognton. Data Mnng and Knowledge Dscovery 2, 121 167(1998) 6. Cao, B., Shen, D., Sun, J.T., Yang, Q., Chen, Z.: Feature selecton n a kernel space. In: Proc. Int. Conf. Machne Learnng (ICML 07) (2007) 7. Cao, Y., Xu, J., Lu, T.Y., L, H., Huang, Y., Hon, H.W.: Adaptng rankng svm to document retreval. In: Proc. ACM SIGIR Int. Conf. Informaton Retreval (SIGIR 06) (2006) 8. Cho, B., Yu, H., Lee, J., Chee, Y., Km, I.: Nonlnear support vector machne vsualzaton for rsk factor analyss usng nomograms and localzed radal bass functon kernels. IEEE Transactons on Informaton Technology n Bomedcne ((Accepted)) 9. Chrstann, N., Shawe-Taylor, J.: An Introducton to support vector machnes and other kernel-based learnng methods. Cambrdge Unversty Press (2000) 10. Cohen, W.W., Schapre, R.E., Snger, Y.: Learnng to order thngs. In: Proc. Advances n Neural Informaton Processng Systems (NIPS 98) (1998) 11. Fung, G., Mangasaran, O.L.: A feature selecton newton method for support vector machne classfcaton. Computatonal Optmzaton and Applcatons (2004) 12. Guyon, I., Elsseeff, A.: An ntroducton to varable and feature selecton. Journal of Machne Learnng Research (2003) 13. Haste, T., Tbshran, R.: Classfcaton by parwse couplng. In: Advances n Neural Informaton Processng Systems (1998) 14. Herbrch, R., Graepel, T., Obermayer, K. (eds.): Large margn rank boundares for ordnal regresson. MIT-Press (2000) 15. J.H.Fredman: Another approach to polychotomous classfcaton. Tech. rep., Standford Unversty, Department of Statstcs, 10:1895-1924 (1998) 16. Joachms, T.: Optmzng search engnes usng clckthrough data. In: Proc. ACM SIGKDD Int. Conf. Knowledge Dscovery and Data Mnng (KDD 02) (2002) 17. Joachms, T.: Tranng lnear svms n lnear tme. In: Proc. ACM SIGKDD Int. Conf. Knowledge Dscovery and Data Mnng (KDD 06) (2006) 18. Mangasaran, O.L.: Generalzed Support Vector Machnes. MIT Press (2000) 19. Mangasaran, O.L.: Exact 1-norm support vector machnes va unconstraned convex dfferentable mnmzaton. Journal of Machne Learnng Research (2006) 20. Mangasaran, O.L., Wld, E.W.: Feature selecton for nonlnear kernel support vector machnes. Tech. rep., Unversty of Wsconsn, Madson (1998) 21. Platt, J.: Fast tranng of support vector machnes usng sequental mnmal optmzaton. In: A.S. B. Scholkopf C. Burges (ed.) Advances n Kernel Methods: Support Vector Machnes. MIT Press, Cambrdge, MA (1998) 22. Scholkopf, B., Herbrch, R., Smola, A.J., Wllamson, R.C.: A generalzed representer theorem. In: Proc. COLT (2001) 23. Smola, A.J., Scholkopf, B.: A tutoral on support vector regresson. Tech. rep., NeuroCOLT2 Techncal Report NC2-TR-1998-030 (1998) 24. Vapnk, V.: Statstcal Learnng Theory. John Wley and Sons (1998) 25. Xu, J., L, H.: Adarank: A boostng algorthm for nformaton retreval. In: Proc. ACM SIGIR Int. Conf. Informaton Retreval (SIGIR 07) (2007) 26. Yan, L., Doder, R., Mozer, M.C., Wolnewcz, R.: Optmzng classfer performance va the wlcoxon-mann-whtney statstcs. In: Proc. Int. Conf. MachneLearnng(ICML 03)(2003) 27. Yu, H.: SVM selectve samplng for rankng wth applcaton to data retreval. In: Proc. Int. Conf. Knowledge Dscovery and Data Mnng (KDD 05) (2005) 28. Yu, H., Hwang, S.W., Chang, K.C.C.: Enablng soft queres for data retreval. Informaton Systems (2007) 29. Yu, H., Km, Y., Hwang, S.W.: RVM: An effcent method for learnng rankng SVM. Tech. rep., Department of Computer Scence and Engneerng, Pohang Unversty of Scence and Technology (POSTECH), Pohang, Korea, http://s.hwanjoyu.org/rvm (2008)

SVM Tutoral: Classfcaton, Regresson, and Rankng 29 30. Yu, H., Yang, J., Wang, W., Han, J.: Dscoverng compact and hghly dscrmnatve features or feature combnatons of drug actvtes usng support vector machnes. In: IEEE Computer Socety Bonformatcs Conf. (CSB 03), pp. 220 228 (2003) 31. Zhu, J., Rosset, S., Haste, T., Tbshran, R.: 1-norm support vector machnes. In: Proc. Advances n Neural Informaton Processng Systems (NIPS 00) (2003)

Index 1-norm rankng SVM, 20 bas, 3 bnary classfer, 1 bnary SVMs, 2 bounded support vector, 17 classfcaton functn, 16 convec functon, 4 curse of dmensonalty problem, 9 data object, 2 data pont, 2 dual problem, 5 feature selecton, 20 feature space, 2 hgh generalzaton, 18 hyperplane, 9 nput space, 2 Kendall s τ, 24 kernel functon, 10 kernel trck, 2 Kuhn-Tucker condtons, 6 lagrange functon, 5 lagrange multpler, 5 LETOR, 27 lnear classfer, 2 lnear programmng (LP) problem, 21 lnear rankng functon, 17, 19 lnearly separable, 3 loss functon, 22 LP algorthm, 21 MAP (mean average precson), 24 margn, 2, 3 Mercer s theorem, 9 msranked, 17 multclass classfcaton, 1 NDCG, 24 NP-hard, 17 OHSUMED, 27 optmzaton problem, 4 optmum weght vector, 6 parwse couplng method, 1 parwse dfference, 17 polynomal, 10 prmal problem, 4 radal bass functon, 10 rankng dffeence, 19 rankng functon, 16 rankng SVM, 16 rankng vector machne (RVM), 21 real-world dataset, 23 regularzer, 22 representer theorem, 22 sequental mnmal optmzaton (SMO), 21 sgmod, 10 slack varable, 6 soft margn parameter, 17 soft margn SVM, 6, 19 standard rankng SVM, 20, 21 strck orderng, 16 support vector, 4 SVM classfcaton functon, 3 SVM regresson, 13 31