A Simple Algorithm for Multilabel Ranking

A Simple Algorithm for Multilabel Ranking Krzysztof Dembczyński 1 Wojciech Kot lowski 1 Eyke Hüllermeier 2 1 Intelligent Decision Support Systems Laboratory (IDSS), Poznań University of Technology, Poland 2 Knowledge Engineering & Bioinformatics Lab (KEBI), Marburg University, Germany EURO 2012, Vilnius, Lithuania

Multiclass Classification politics 0 economy 0 business 0 sport 0 tennis 1 soccer 0 show-business 0 celebrities 0. England 0 USA 0 Poland 0 Lithuania 0 2 / 21

Multilabel Classification politics 0 economy 0 business 0 sport 1 tennis 1 soccer 0 show-business 0 celebrities 1. England 1 USA 1 Poland 1 Lithuania 0 2 / 21

Multilabel Ranking tennis sport England Poland USA. politics 2 / 21

Multilabel Ranking The goal is to learn a function h(x) = (h 1 (x),..., h m (x)) such that it ranks for a given vector x X the binary labels y = (y 1,..., y m ) from the most to the least relevant. X 1 X 2 Y 1 Y 2... Y m x 1 5.0 4.5 1 1 0 x 2 2.0 2.5 0 1 0...... x n 3.0 3.5 0 1 1 x 4.0 2.5??? 3 / 21

Multilabel Ranking We use rank loss to measure the quality of ranking: ( l rnk (y, h) = w(y) h i (x) < h j (x) + 1 ) 2 h i(x) = h j (x), (i,j): y i >y j where w(y) < w max is a weight function. X 1 X 2 Y 1 Y 2... Y m x 4.0 2.5 1 0 0 h 2 > h 1 >... > h m 4 / 21

Multilabel Ranking We use rank loss to measure the quality of ranking: ( l rnk (y, h) = w(y) h i (x) < h j (x) + 1 ) 2 h i(x) = h j (x), (i,j): y i >y j where w(y) < w max is a weight function. The weight function w(y) is usually used to normalize the range of rank loss to [0, 1]: w(y) = (s y (m s y )) 1, where s y = i y i, i.e., it is equal to the inverse of the total number of pairwise comparisons between labels. 4 / 21

Risk and Bayes Classifier We would like to minimize the expected rank loss, or risk, of h(x): L rnk (h, P )=E [l(y, h(x))]= l rnk (y, h(x)) dp (x, y), The optimal solution can be determined pointwise, for each x X separately: L rnk (h, P x)=e [l rnk (Y, h(x)) x]= y Yl rnk (y, h(x))p (y x). The optimal classifier h, referred to as Bayes classifier, is given then by: h (x) = arg min l rnk (y, s)p (y x). s R m y Y It is often more reasonable to compare a given h to the Bayes classifier h by means of the regret defined by: Reg l (h, P ) = L l (h, P ) L l (h, P ) 5 / 21

Regret and Consistency Since rank loss is neither convex nor differentiable, we want to use a surrogate loss that is easier in optimization and leads to similar results. We say that a surrogate loss l is consistent with the rank loss when the following holds: Reg l (h, P ) 0 Reg rnk (h, P ) 0. 6 / 21

Pairwise Surrogate Losses The most intuitive approach is to use pairwise convex surrogate losses of the form l φ (y, h) = w(y)φ(h i h j ), (i,j): y i >y j where φ is an exponential function (BoosTexter) 1 : φ(f) = e f, logistic function (LLLR) 2 : φ(f) = log(1 + e f ), or hinge function (RankSVM) 3 : φ(f) = max(0, 1 f). 1 R. E. Schapire and Y. Singer. BoosTexter: A Boosting-based System for Text Categorization. Machine Learning, 39(2/3):135 168, 2000 2 O. Dekel, Ch. Manning, and Y. Singer. Log-linear models for label ranking. In NIPS, 2003 3 A. Elisseeff and J. Weston. A kernel method for multi-labelled classification. In NIPS, pages 681 687, 2001 7 / 21

Surrogate Losses φ 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Boolean Test Exponential Logistic Hinge 3 2 1 0 1 2 3 f 8 / 21

Pairwise Surrogate Losses Let us denote: uv ij = y : y i =u,y j =v w(y)p (y x). uv ij reduces to P (Y i = u, Y j = v x), for w(y) 1. uv ij = vu ji for all (i, j) Let W = E[w(Y ) x] = y w(y)p (y x). Then 00 ij + 01 ij + 10 ij + 11 ij = W. P (y) w Y 1 Y 2 Y 3 0.0 1 0 0 0 0.0 1 0 0 1 0.2 1 0 1 0 0.2 1 0 1 1 0.4 1 1 0 0 0.1 1 1 0 1 0.1 1 1 1 0 0.0 1 1 1 1 10 01 10 01 10 01 12 0.5 12 0.4 13 0.5 13 0.2 23 0.3 23 0.1 9 / 21

Pairwise Surrogate Losses The conditional risk can be written as: L rnk (h, P x) = ( 10 ij h i < h j + 01 ij h i > h j i>j + 1 ) 2 ( 10 ij + 01 ij ) h i = h j Its minimum (the Bayes risk) is L rnk (P x) = i>j min{ 10 ij, 01 ij }. While the conditional risk of pairwise surrogate loss is: L φ (h, P x) = i>j 10 ij φ(h i h j ) + 10 ji φ(h j h i ), and a necessary condition for consistency is that the Bayes classifier h for φ-loss is also the Bayes ranker, i.e., sign(h i h j) = sign( 10 ij 01 ij ). 10 / 21

Multilabel Ranking This approach is, however, inconsistent for the most commonly used convex surrogates (Dutchi et al. 2010 4, Gao and Zhou 2011 5 ). The (nonlinear monotone) transformation φ applies to the differences h i h j, so the minimization of the pairwise convex losses result in a complicated solution h, where h i generally depends on all 10 jk (1 j, k m). The only case in which the above convex pairwise loss is consistent is when the labels are independent (the case of bipartite ranking). There exists a class of pairwise surrogates that is consistent, but we will present a different, simpler approach that is also consistent. 4 J. Duchi, L. Mackey, and M. Jordan. On the consistency of ranking algorithms. In ICML, pages 327 334, 2010 5 W. Gao and Z. Zhou. On the consistency of multi-label learning. In COLT, pages 341 358, 2011 11 / 21

Reduction to Weighted Binary Relevance The Bayes ranker can be obtained by sorting labels according to: 1 i = w(y)p (y x). y : y i =1 For w(y) 1, the labels should be sorted according to their marginal probabilities, since u i reduces to P (Y i = u x) in this case (Dembczynski et al. 2010). 6 6 K. Dembczyński, W. Cheng, and E. Hüllermeier. Bayes optimal multilabel classification via probabilistic classifier chains. In ICML, pages 279 286, 2010 12 / 21

Reduction to Weighted Binary Relevance Remind that the minimum (the Bayes risk) is L rnk (P x) = i>j Since 1 i = 10 prove that min{ 10 ij, 01 ij }. ij + 11 ij, we can 1 i 1 j = 10 ij 01 ij. P (y) w Y 1 Y 2 Y 3 0.0 1 0 0 0 0.0 1 0 0 1 0.2 1 0 1 0 0.2 1 0 1 1 0.4 1 1 0 0 0.1 1 1 0 1 0.1 1 1 1 0 0.0 1 1 1 1 i 1 0.6 0.5 0.3 10 01 10 01 10 01 12 0.5 12 0.4 13 0.5 13 0.2 23 0.3 23 0.1 13 / 21

Reduction to Weighted Binary Relevance Consider the univariate (weighted) exponential and logistic loss: l exp (y, h) = w(y) l log (y, h) = w(y) m e (2y i 1)h i, i=1 m i=1 The risk minimizer of these losses is: ) log (1 + e (2y i 1)h i. h i (x) = 1 c log 1 i 0 i = 1 c log 1 i W 1 i, which is a strictly increasing transformation of 1 i, where W = E[w(Y ) x] = y w(y)p (y x). 14 / 21

Main Result 7 Theorem Let Reg rnk (h, P ) be the regret for rank loss, and Reg exp (h, P ) and Reg log (h, P ) be the regrets for exponential and logistic losses, respectively. Then 6 Reg rnk (h, P ) 4 C Reg exp (h, P ), 2 Reg rnk (h, P ) 2 C Reg log (h, P ), where C m mw max. 7 K. Dembczyński, W. Kot lowski, and E. Hüllermeier. Consistent multilabel ranking through univariate losses. In ICML, 2012 15 / 21

Reduction to Weighted Binary Relevance Vertical reduction: Solving m independent classification problems. Many algorithms that minimize (weighted) exponential or logistic surrogate, such as AdaBoost or logistic regression, can be applied. Besides its simplicity and efficiency, this approach is consistent. 16 / 21

Empirical results We use the rank loss with weights defined as: w(y) = (s y (m s y )) 1, where s y = i y i, i.e., the inverse of the total number of pairwise comparisons between labels. We compare algorithms in terms of surrogate losses: The exponential loss: AdaBoost.MR vs WBR-AdaBoost The logistic loss: LLLR vs WBR Logistic Regression Synthetic and benchmark data sets 17 / 21

Logistic loss rank loss 0.170 0.172 0.174 WBR LR LLLR Bayes risk 250 500 1000 2000 4000 8000 16000 # of learning examples rank loss 0.182 0.184 0.186 WBR LR LLLR Bayes risk 250 500 1000 2000 4000 8000 16000 # of learning examples Figure: Left: independent data. Right: dependent data. In the case of label independence, the methods perform more or less en par. In the case where labels are dependent, univariate approach shows small but consistent improvements. 18 / 21

Exponential loss rank loss 0.170 0.180 0.190 WBR AdaBoost AdaBoost.MR Bayes risk 250 500 1000 2000 4000 8000 16000 # of learning examples rank loss 0.185 0.195 0.205 WBR AdaBoost AdaBoost.MR Bayes risk 250 500 1000 2000 4000 8000 16000 # of learning examples Figure: Left: independent data. Right: dependent data. Strange behavior of AdaBoost.MR: for more than 20 stumps it quickly overfits. In both cases, the univariate approach outperforms the pairwise approach. 19 / 21

Benchmark Data Table: Exponential loss-based (left) and logistic loss-based algorithms (right). For each dataset, the winner out of the two competing algorithms is marked by a *. dataset AB.MR WBR-AB LLLR WBR-LR image 0.2081 *0.2041 *0.2047 0.2065 emotions 0.1703 *0.1699 0.1743 *0.1657 scene *0.0720 0.0792 0.0861 *0.0793 yeast 0.2072 *0.1820 *0.1728 0.1736 mediamill 0.0665 *0.0609 0.0614 *0.0472 The simple reduction algorithms trained independently on each label are at least competitive to state-of-the-art algorithms defined on pairwise surrogates. 20 / 21

Conclusions We have shown that common univariate convex surrogates are consistent for mutlilabel ranking. We proved explicit regret bounds, relating ranking regret to univariate loss regret. The results are arguably surprising in light of the previous ones, where inconsistency is shown for the most popular pairwise surrogates. On the more practical side, our results motivate simple and scalable algorithms for multilabel ranking, which are plain modifications of standard algorithms for classification. This project is partially supported by the Foundation of Polish Science under the Homing Plus programme, co-financed by the European Regional Development Fund.