Support vector comparison machines

Size: px

Start display at page:

Download "Support vector comparison machines"

Vivian Posy Eaton
5 years ago
Views:

1 THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE Abstract Support vector comparison machines Toby HOCKING, Supaporn SPANURATTANA, and Masashi SUGIYAMA Department of Computer Science, Tokyo Institute of Technology ---W-4, O-okayama, Meguro-ku, Tokyo, 5-55, Japan In ranking problems, the goal is to learn a ranking function r(x) R from labeled pairs x, x of input points In this paper, we consider the related comparison problem, where the label y {,, } indicates which element of the pair is better, or if there is no significant difference We cast the learning problem as a margin maximization, and show that it can be solved by converting it to a standard SVM We compare our algorithm to SVMrank using simulated data Key words support vector machine, quadratic program, linear program, ranking, margin, comparison Introduction In this paper we consider the supervised comparison problem Assume that we have n labeled training pairs and for each pair i {,, n} we have input features x i, x i R p and a label y i {,, } that indicates which element is better: if x i is better than x i, y i = if x i is as good as x i, if x i is better than x i These data are geometrically represented by segments and arrows in the top panel of Figure () Comparison data naturally arise when considering subjective human evaluations of pairs of items For example, if I were to compare some pairs of movies I have watched, I would say Les Misérables is better than Star Wars, and The Empire Strikes Back is as good as Star Wars Features x i, x i of the movies can be length in minutes, year of theatrical release, indicators for genre, actors/actresses, directors, etc The goal of learning is to find a comparison function c : R p R p {,, } which generalizes to a test set of data, as measured by the zero-one loss: c i test where I is the indicator function I [ c(x i, x i) = y i ], () The rest of this article is organized as follows In Section we discuss links with related work on classification and ranking, then in Section 3 we propose a new algorithm: SVMcompare We show results on simulated data in Section 4 and discuss future work in Section 5 Related work First we discuss connections with several existing methods in the machine learning literature, and then we discuss how ranking algorithms can be applied to the comparison problem Rank, reject, and rate Comparison is similar to ranking and classification with a reject option (Table ) The classification with reject option problem is a version of binary classification with outputs y {,, }, where signifies rejection or no guess [] There are many algorithms for the supervised learning to rank problem [], which is similar to the supervised comparison problem we consider in this paper The main idea of learning to rank is to train on labeled pairs of possible documents x i, x i for the same search query The labels are y i {, }, where y i = means document x i is more relevant than x i and y i = means the opposite The SVMrank algorithm was proposed for this problem [3], and the algorithm we propose here is similar The difference is that we also consider the case where both items/documents are judged to be equally good (y i = ) A boosting algorithm has been proposed for this ranking with ties problem [4], and the authors observed that modeling ties is more effective Inputs single items x pairs x, x Outputs y {, } SVM SVMrank y {,, } Reject option this work Table Comparison is similar to ranking and classification with reject option

2 when there are more output values A Bayesian model which can be applied to these data is TrueSkill [5], a generalization of the Elo chess rating system When the inputs are discrete x i, x i {,, k}, then the problem is known as learning a relation [] In this article we consider the case when inputs are continuous x i, x i R p SVMrank for comparing In this section we explain how to apply the existing SVMrank algorithm to a comparison data set The goal of the SVMrank algorithm is to learn a ranking function r : R p R When r(x) = w x is linear, the primal problem for some cost parameter C R + is the following quadratic program (QP): w,ξ w w + C i I I ξ i subject to i I I, ξ i > =, and ξ i > = w (x i x i )y i, where I y = {i y i = y} are the sets of indices for the different labels Note that the equality pairs i I are not used in the optimization problem (3) After obtaining a weight vector w R p and ranking function r by solving (3), we define a threshold τ R + and a thresholding function t τ : R {,, } if z < τ, t τ (z) = if z < = τ, if z > τ A comparison function c τ : R p R p {,, } is defined as the thresholded difference of predicted ranks (4) c τ (x, x ) = t τ ( r(x ) r(x) ) (5) We can then use grid search to estimate an optimal threshold ˆτ, by minimizing the zero-one loss with respect to all the training pairs: ˆτ = arg min τ n I [ c τ (x i, x ] i) = y i () i= However, there are two potential problems with the learned comparison function cˆτ First, the equality pairs i I are not used to learn the weight vector w in (3) Second, the threshold ˆτ is learned in a separate optimization step, which may be suboptimal In the next section, we propose a new algorithm that fixes these potential problems 3 Support vector comparison machines In this section we discuss new learning algorithms for comparison problems In all cases, we will first learn a ranking function r : R p R and then a comparison function difference feature additional feature input feature 5 4 label y i Φ(x ) Φ(x ) Φ(x ) 3 - x x x x Φ(x ) x x input feature Φ(x ) Φ(x ) additional feature Φ(x ) Φ(x) y i = Φ(x ) Φ(x) y i = Φ(x ) Φ(x) w [Φ(x ) Φ(x)] = y i = w [Φ(x ) Φ(x)] = r(x) = w Φ(x) difference feature Figure Geometric interpretation Top: input feature pairs x i, x i R p are segments and arrows, colored using the labels y i {,, } The level curves of the ranking function r(x) = x are grey, and differences r(x) r(x ) < = are considered insignificant (y i = ) Middle: in the enlarged feature space, the ranking function is linear: r(x) = w Φ(x) Bottom: two symmetric hyperplanes w [Φ(x i ) Φ(x i)] {, } are used to classify the difference vectors Difference of enlarged features Φ(x ) Φ(x) Enlarged feature space Φ(x) Original feature space x R p

3 c : R p R p {,, }: c(x, x ) = t ( r(x ) r(x) ), () using the threshold function t : R {,, } (4) In other words, any small difference in ranks r(x ) r(x) < = is considered insignificant, and there are two decision boundaries r(x ) r(x) {, } 3 LP and QP for separable data To illustrate the nature of the max-margin comparison problem, in this section we assume that the training data are linearly separable Later in Section 3, we propose an algorithm for learning a nonlinear function from non-separable data Using the following linear program (LP), we learn a linear ranking function r(x) = w x that maximizes the geometric margin µ As shown in the left panel of Figure, the geometric margin µ is the smallest distance from any difference vector x i x i to a decision boundary r(x) {, } The max margin LP is maximize µ () µ R +,w R p subject to µ < = w (x i x i ), i I, µ < = + w (x i x i )y i, i I I Note that finding a feasible point for this LP is a test of linear separability If there are no feasible points then the data are not linearly separable Another way to formulate the comparison problem is by first performing a change of variables, and then solving a binary SVM QP The idea is to maximize the margin between significant differences y i {, } and equality pairs y i = Let X y, X y be the I y p matrices formed by all the pairs i I y We define a new flipped data set with m = I + I + I pairs suitable for training a binary SVM: X = X X X X, X = X X X X, ỹ = where n is an n-vector of ones, X, X I I I I R m p, () and ỹ {, } m Note that ỹ i = implies no significant difference between x i and x i, and ỹ i = implies that x i is better than x i We then learn an affine function f(x) = β + u x using a binary SVM QP (black lines in middle and right panels of Figure ): u R p,β R subject to u u () ỹ i (β + u ( x i x i )) > =, i {,, m} Intuitively, the SVM QP learns a separator f(x) = between significant difference pairs ỹ i = and insignificant difference pairs ỹ i = However, we want a comparison function that predicts c(x, x ) {,, } So we use the next Lemma to construct a ranking function r(x) = ŵ x that is feasible for the original max margin comparison LP (), and can be used with the comparison function () Lemma Let u R p, β R be a solution of () Then ˆµ = /β and ŵ = u/β are feasible for () Proof Begin by assuming that we want to find a ranking function r(x) = ŵ x = γu x, where γ R is a scaling constant Then consider that for all x on the decision boundary, we have difference feature - original x x flipped x x scaled and flipped x x QP ỹ i = y i = y i = LP y i = ỹ i = point LP constraint active LP constraint inactive QP support vector line decision r(x) = ± margin r(x) = ± ± µ difference feature Figure The separable LP and QP comparison problems Left: the difference vectors x x of the original data and the optimal solution to the LP () Middle: for the unscaled flipped data x x (), the LP is not the same as the QP (3) Right: in these scaled data, the QP is equivalent to the LP 3

4 r(x) = ŵ x = and f(x) = u x + β = () Taken together, it is clear that γ = /β and thus ŵ = u/β Consider for all x on the margin we have r(x) = ŵ x = + ˆµ and f(x) = u x + β = () Taken together, these imply ˆµ = /β Now, by definition of the flipped data (), we can re-write the max margin QP () as K i = κ( x, x i) κ( x m, x i) κ( x, x i ) κ( x m, x i ), K i = κ( x, x i) κ( x m, x i) κ( x, x i) κ( x m, x i) (5) u R p,β R u u (3) subject to β + u (x i x i) < =, i I, β + u (x i x i )y i > =, i I I By re-writing the constraints of (3) in terms of ˆµ and ŵ, we recover the same constraints as the max margin comparison LP () Thus ˆµ, ŵ are feasible for () One may also wonder: are ˆµ, ŵ optimal for the max margin comparison LP? In general, the answer is no, and we give one counterexample in the middle panel of Figure This is because the LP defines the margin in terms of ranking function values r(x) = w x, but the QP defines the margin in terms of the size of the normal vector u, which depends on the scale of the inputs x, x However, when the input variables are scaled in a pre-processing step, we have observed that the solutions to the LP and QP are equivalent (right panel of Figure ) Lemma is very useful in practice It means that a ranking function r can be learned on comparison data by first solving a standard binary SVM, and then transforming the solution We use this result in the next section to build a general support vector algorithm for comparison data 3 Kernelized QP for non-separable data In this section, we assume the data are not separable, and want to learn a nonlinear ranking function We define a positive definite kernel κ : R p R p R, which implicitly defines an enlarged set of features Φ(x) (middle panel of Figure ) As in (), we learn a function f(x) = β + u Φ(x) which is affine in the feature space Let α, α R m be coefficients such that u = m i= αiφ( xi) + α iφ( x i), and so we have f(x) = β + m i= αiκ( xi, x) + α iκ( x i, x) We then use Lemma to define the ranking function r(x) = u Φ(x) β = m i= α iκ( x i, x) + α iκ( x i, x) (4) β Let K = [K K m K K m] R m m be the kernel matrix, where for all pairs i {,, m}, the kernel vectors K i, K i R m are defined as Letting a = [α α ] R m, the norm of the linear function in the feature space is u u = a Ka, and we can write the primal soft-margin comparison QP for some C R + as a R m,ξ R m,β R a Ka + C m i= subject to for all i {,, m}, ξ i > =, ξ i and ξ i > = ỹ i(β + a (K i K i)) () Let λ, v R m be the dual variables, let Y = Diag(ỹ) be the diagonal matrix of m labels Then the Lagrangian can be written as L = a Ka + Cξ m λ ξ + v ( m ỹβ Y M Ka ξ), () where M = [ I m I m ] {,, } m m Solving a L = results in the following stationary condition: a = MY v () The rest of the derivation of the dual comparison problem is the same as for the standard binary SVM The resulting dual QP is v R m v Y M KMY v v m m subject to v iỹ i =, i= for all i {,, m}, < = v i < = C, () which is equivalent to the dual problem of a standard binary SVM with kernel K = M KM R m m ỹ {, } m and labels So we can solve the dual comparison problem () using any efficient SVM solver, such as libsvm [] We used the R interface in the kernlab package [], and our code is available in the tdhock/ranksvmcompare package on Github After obtaining optimal dual variables v R m as the solution of (), the SVM solver also gives us the optimal bias β by analyzing the complementary slackness conditions The learned ranking function can be quickly evaluated since the optimal v is sparse Let sv = {i : v i > } be the indices of the support vectors Since we need only sv kernel evaluations, the ranking function (4) becomes 4

5 r(x) = i sv ỹ i v i [ κ( xi, x) κ( x i, x) ] /β () Note that for all i {,, m}, the optimal primal variables α i = ỹ i v i and α i = ỹ i v i are recovered using the stationary condition () The learned comparison function remains the same () The training procedure is summarized as Algorithm There are two sub-routines: KernelMatrix computes the m m kernel matrix, and SVMdual solves the SVM dual QP () There are two hyper-parameters to tune: the cost C and the kernel κ As with standard SVM for binary classification, these parameters can be tuned by minimizing the prediction error on a held-out validation set 4 Results The goal of our learning algorithm is to accurately predict a test set of labeled pairs () We use the zero-one loss for evaluating the learned comparison function on the test set (Table ) 4 Simulation: norms in d We used a simulation to test the proposed SVMcompare model and a baseline ranking model that ignores the equality y i = pairs, SVMrank [3] The goal of our simulation is to demonstrate that our model can perform better by learning from the equality y i = pairs, when there are few inequal- Algorithm SVMcompare Input: cost C R +, kernel κ : R p R p R, features X, X R n p, labels y {,, } n X [ X X X X ] X [ X X X X ] ỹ [ I I I I ] K KernelMatrix( X, X, κ) M [ I m I m ] K M KM v, β SVMdual( K, ỹ, C) sv {i : v i > } Output: Support vectors variables v Table y ŷ X sv, X sv, labels ỹ sv, bias β, dual - - FP Inversion FN FN Inversion FP We use the zero-one loss to evaluate a predicted label ŷ given the true label y False positives (FP) occur when predicting a significant difference ŷ {, } when there is none y =, and False Negatives (FN) are the opposite Inversions occur when predicting the opposite of the true label ŷ = y ity y i {, } pairs We generated pairs x i, x i [ 3, 3] and noisy labels y i = t [r(x i) r(x i ) + ϵ i ], where t is the threshold function (4), r is the ranking function, and ϵ i N(, /4) is noise We picked train, validation, and test sets, each with n/ equality pairs and n/ inequality pairs, for n {5,, } We fit a grid of models to the training set (cost parameter C = 3,, 3, Gaussian kernel width,, 4 ), select the model with minimal zero-one loss on the validation set, and then use the test set to estimate the generalization ability of the selected model For the SVMrank model explained in Section, equality pairs are ignored when learning the ranking function, but used to learn a threshold ˆτ via grid search for when to predict c(x, x ) = In Figure 3 we show the training set for n = pairs, and the level curves of the ranking functions learned by the SVMrank and SVMcompare models It is clear that SVMcompare recovers a ranking function that is closer to the r, especially in the case of the squared l norm r(x) = x In Figure 4 the test error of the proposed SVMcompare model is clearly lower than that of the baseline SVMrank model, especially with more labeled pairs n 5 Conclusions and future work We discussed an extension of SVM to comparison problems Our results highlighted the importance of directly modeling the equality pairs (y i = ), and it will be interesting to see if the same results are observed in learning to rank data sets For scaling to very large data sets, it will be interesting to try algorithms based on smooth discriminative loss functions, such as stochastic gradient descent with a logistic loss Acknowledgements: TDH was funded by KAKENHI 34, SS by a MEXT scholarship, and MS by KAK- ENHI Thanks to Simon Lacoste-Julien and Hang Li for helpful discussions References [] PL Bartlett and MH Wegkamp, Classification with a reject option using a hinge loss, Journal of machine learning research, vol, pp3 4, [] H Li, A short introduction to learning to rank, IEICE Transactions on Information and Systems, vole4-d, no, pp54, [3] T Joachims, Optimizing search engines using clickthrough data, KDD, pp33 4, [4] K Zhou, G-R Xue, H Zha, and Y Yu, Learning to 5

6 rank with ties, Proc ACM SIGIR 3, pp5, SI- GIR, New York, NY, [5] R Herbrich, T Minka, and T Graepel, Trueskill TM : A bayesian skill rating system, Advances in Neural Information Processing Systems, pp5 5, [] D Meyer and K Hornik, relations: Data structures and algorithms for relations, 3 R package version - [] C-C Chang and C-J Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology, vol, pp: :, [] A Karatzoglou, A Smola, K Hornik, and A Zeileis, kernlab an S4 package for kernel methods in R, Journal of Statistical Software, vol, no, pp, 4

7 r(x) = x r(x) = x r(x) = x feature - - training data rank compare label equality pair y i = inequality pair y i {, } ranking function learned feature Figure 3 Application to simulated patterns x, x R, where the ranking functions r(x) = x are squared norms (panels from left to right) Top: the training data are n = pairs, half equality (segments indicate two points of equal rank), and half inequality (arrows point to the higher rank) Middle and bottom: level curves of the ranking functions learned by SVMrank and SVMcompare SVMrank ignores the equality pairs, so in general SVMcompare recovers the pattern better percent incorrectly predicted test pairs 3 r(x) = x r(x) = x r(x) = x rank compare rank compare rank compare n = number of labeled pairs in the training set Figure 4 Test error of SVMrank and SVMcompare (the Bayes error rate of the r is shown for comparison) We plot mean and standard deviation of the prediction error across 4 randomly chosen data sets, as a function of training set size n (a vertical line shows the data set size n = which was used in Figure 3) It is clear that in general SVMcompare makes better predictions on the test data

arxiv: v2 [stat.ml] 20 Dec 2017

arxiv: v2 [stat.ml] 20 Dec 2017 Journal of Machine Learning Research 1 (2017) 1-48 Submitted 12/17; Published? Support vector comparison machines arxiv:1401.8008v2 [stat.ml] 20 Dec 2017 David Venuto Department of Human Genetics McGill