Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Krzysztof Dembczyński and Wojciech Kot lowski Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland PAN Summer School, 29.06-04.07.2014
Agenda 1 Binary Classification 2 Bipartite Ranking 3 Multi-Label Classification 4 Reductions in Multi-Label Classification 5 Conditional Ranking The project is co-financed by the European Union from resources of the European Social Found 1 / 44
Outline 1 Ranking problem 2 Multilabel ranking 3 Summary 2 / 44
Outline 1 Ranking problem 2 Multilabel ranking 3 Summary 3 / 44
Ranking problem Ranking problem from the learning perspective: train a model that sorts items according to the preferences of a subject. Problems varies in the preference structure and training information: Bipartite, multipartite and object ranking, Ordinal classification/regression, Multi-label ranking, Conditional ranking. 4 / 44
Object ranking Ranking of national football teams. 5 / 44
Multi-label ranking Sort document tags by relevance. tennis sport Wimbledon Poland USA. politics 6 / 44
Label ranking Training data: {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )}, where y i is a ranking (permutation) of a fixed number of labels/alternatives. 1 Predict permutation (y π(1), y π(2),..., y π(m) ) for a given x. X 1 X 2 Y 1 Y 2 Y m x 1 5.0 4.5 1 3 2 x 2 2.0 2.5 2 1 3..... x n 3.0 3.5 3 1 2 x 4.0 2.5??? 1 E. Hüllermeier, J. Fürnkranz, W. Cheng, and K. Brinker. Label ranking by learning pairwise preferences. Artificial Intelligence, 172:1897 1916, 2008 7 / 44
Label ranking Training data: {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )}, where y i is a ranking (permutation) of a fixed number of labels/alternatives. 1 Predict permutation (y π(1), y π(2),..., y π(m) ) for a given x. X 1 X 2 Y 1 Y 2 Y m x 1 5.0 4.5 1 3 2 x 2 2.0 2.5 2 1 3..... x n 3.0 3.5 3 1 2 x 4.0 2.5 1 2 3 1 E. Hüllermeier, J. Fürnkranz, W. Cheng, and K. Brinker. Label ranking by learning pairwise preferences. Artificial Intelligence, 172:1897 1916, 2008 7 / 44
Collaborative filtering 2 Training data: {(u i, m j, y ij )}, for some i = 1,..., n and j = 1,..., m, y ij Y = R. Predict y ij for a given u i and m j. m 1 m 2 m 3 m m u 1 1 4 u 2 3 1 u 3 2 5... u n 2 1 2 D. Goldberg, D. Nichols, B.M. Oki, and D. Terry. Using collaborative filtering to weave and information tapestry. Communications of the ACM, 35(12):61 70, 1992 8 / 44
Dyadic prediction 3 4 5 7 8 6 10 14 9 21 12 instances y 1 y 2 y m y m+1 y m+2 1 1 x 1 10? 1?? 3 5 x 2 0.1 0? 7 0 x 3?? 1? 1 1... 0? 3 1 x n 0.9 1?? 2 3 x n+1??? 3 1 x n+2???? 3 A.K. Menon and C. Elkan. Predicting labels for dyadic data. Data Mining and Knowledge Discovery, 21(2), 2010 9 / 44
Query-document models. Conditional ranking 10 / 44
Feedback information Different types of feedback information: utility scores: x 1 0.19 x 2 0.93 x 3 0.71 x 4 0.52 x 5 0.09 11 / 44
Feedback information Different types of feedback information: total order: x 2 x 3 x 4 x 1 x 5 11 / 44
Feedback information Different types of feedback information: partial order: x 2 x 3 x 4 x 1 x 5 11 / 44
Feedback information Different types of feedback information: pairwise comparisons: x 2 x 3, x 2 x 4, x 2 x 1, x 2 x 5, x 3 x 1, x 3 x 5, x 4 x 1, x 4 x 5. 11 / 44
Feedback information Different types of feedback information: ordinal labels: x 1 1 x 2 5 x 3 4 x 4 3 x 5 1 x 2 x 3, x 2 x 4, x 2 x 1, x 2 x 5 x 3 x 1, x 3 x 4, x 3 x 5, x 4 x 1, x 4 x 5. 11 / 44
Feedback information Different types of feedback information: binary labels x 1 0 x 2 1 x 3 1 x 4 1 x 5 0 x 2 x 1, x 3 x 1, x 4 x 1, x 2 x 5, x 3 x 5, x 4 x 5. 11 / 44
Task losses Performance measures (task losses) in the ranking problems: Pairwise disagreement (also referred to as rank loss) ( ), Discounted cumulative gain ( ), Average precision ( ), Expected reciprocal rank ( ),... 12 / 44
Task losses Performance measures (task losses) in the ranking problems: Pairwise disagreement (also referred to as rank loss) ( ), Discounted cumulative gain ( ), Average precision ( ), Expected reciprocal rank ( ),... These measures are usually neither convex nor differentiable hard in optimization. 12 / 44
Task losses Performance measures (task losses) in the ranking problems: Pairwise disagreement (also referred to as rank loss) ( ), Discounted cumulative gain ( ), Average precision ( ), Expected reciprocal rank ( ),... These measures are usually neither convex nor differentiable hard in optimization. Learning algorithms rather employ surrogate losses to facilitate the optimization problem. 12 / 44
Can we design for a given ranking problem a surrogate loss that will provide a near-optimal solution with respect to a given task loss? 13 / 44
Pairwise disagreement Let r r be the true ranks of two objects and ˆr, ˆr be the predicted ranks of the same objects. Pairwise disagreement can be expressed by counting errors of the type: ˆr ˆr In general, the problem cannot be easily solved, but for some special cases it is possible. 4 4 J. Duchi, L. Mackey, and M. Jordan. On the consistency of ranking algorithms. In ICML, pages 327 334, 2010 W. Kot lowski, K. Dembczyński, and E. Hüllermeier. Bipartite ranking through minimization of univariate loss. In International Conference on Machine Learning, pages 1113 1120, 2011 14 / 44
Discounted cumulative gain Let us assume that there are n objects to rank. Let r, ˆr {1,..., n} represent true and predicted rank of an object, respectively. Discounted cumulative gain can be expressed by n ˆr i =1 2 n r i 1 log(1 + ˆr i ) ˆr: 1 2 3 4 5 6 7 8 9 r: 1 5 3 4 2 6 7 8 9 Reduction to regression 5 or multi-class classification 6 is possible. 5 D. Cossock and T. Zhang. Statistical analysis of bayes optimal subset ranking. IEEE Trans. Info. Theory, 54:5140 5154, 2008 6 Ping Li, Christopher J. C. Burges, and Qiang Wu. McRank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2007 15 / 44
Average precision Let us assume that there is n objects to rank. Let r {0, 1} be the relevance and ˆr {1,..., n} the predicted rank of the same object. Average precision can be expressed by: 1 {i : r i = 1} ˆr i i:r i =1 k=1 ˆr: 1 2 3 4 5 6 7 8 9 r: 1 1 0 0 1 0 1 0 0 Theoretical analysis in terms of surrogate losses. 7 r k ˆr i 7 Clément Calauzènes, Nicolas Usunier, and Patrick Gallinari. Calibration and regret bounds for order-preserving surrogate losses in learning to rank. Machine Learning, 93(2-3):227 260, 2013 16 / 44
Expected reciprocal rank Let us assume that there is n objects to rank. Let r, ˆr {1,..., n} represent true and predicted rank of an object, respectively. Expected reciprocal rank 8 can be expressed by: ERR = = n k=1 n k=1 1 P (user stops at i) k k 1 1 k R k (1 R q ), R k = 2n r k 1 2 n 1. q=1 Theoretical analysis in terms of surrogate losses. 9 8 O.Chapellea and Y.Chang. Yahoo! learning to rank challenge overview. J.of Mach. Learn. Res., 14:1 24, 2011 9 Clément Calauzènes, Nicolas Usunier, and Patrick Gallinari. Calibration and regret bounds for order-preserving surrogate losses in learning to rank. Machine Learning, 93(2-3):227 260, 2013 17 / 44
Setting Objects (x, y) generated from an unknown distribution P (x, y). Risk (expected loss) of the function h(x): where l is a loss function. The regret of a classifier: where h is the Bayes classifier, L l (h) := E (x,y) [l(y, h(x))], Reg l (h) = L l (h) L l (h ), h = arg min L l (h). h 18 / 44
Setting Since task losses are usually neither convex nor differentiable, we use surrogate (or proxy) losses that are easier in optimization. We say that a surrogate loss l is consistent (calibrated) with the task loss l when the following holds: Reg l(h) 0 Reg l (h) 0. 19 / 44
Outline 1 Ranking problem 2 Multilabel ranking 3 Summary 20 / 44
Multilabel ranking Training data: {((x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ), } y i {0, 1} m Sort labels from the most to the least relevant for a given x. X 1 X 2 Y 1 Y 2... Y m x 1 5.0 4.5 1 1 0 x 2 2.0 2.5 0 1 0...... x n 3.0 3.5 0 1 1 x 4.0 2.5??? 21 / 44
Multilabel ranking Training data: {((x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ), } y i {0, 1} m Sort labels from the most to the least relevant for a given x. X 1 X 2 Y 1 Y 2... Y m x 1 5.0 4.5 1 1 0 x 2 2.0 2.5 0 1 0...... x n 3.0 3.5 0 1 1 x 4.0 2.5 h 2 > h 1 >... > h m 21 / 44
Multilabel ranking Training data: {((x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ), } y i {0, 1} m Sort labels from the most to the least relevant for a given x. X 1 X 2 Y 1 Y 2... Y m x 1 5.0 4.5 1 1 0 x 2 2.0 2.5 0 1 0...... x n 3.0 3.5 0 1 1 x 4.0 2.5 y 2 y 1... y m 21 / 44
Multilabel ranking Rank loss: l(y, h(x)) = w(y) ( h i (x) < h j (x) + 1 ) 2 h i(x) = h j (x), (i,j): y i >y j where w(y) < w max is a weight function. X 1 X 2 Y 1 Y 2... Y m x 4.0 2.5 1 0 0 h 2 > h 1 >... > h m 22 / 44
Multilabel ranking Rank loss: l(y, h(x)) = w(y) ( h i (x) < h j (x) + 1 ) 2 h i(x) = h j (x), (i,j): y i >y j where w(y) < w max is a weight function. The weight function w(y) is usually used to normalize the range of rank loss to [0, 1]: w(y) = 1 n + n, i.e., it is equal to the inverse of the total number of pairwise comparisons between labels. 22 / 44
Pairwise surrogate losses The most intuitive approach is to use pairwise convex surrogate losses of the form l φ (y, h) = w(y)φ(h i h j ), (i,j): y i >y j where φ is an exponential function (BoosTexter) 10 : φ(f) = e f, logistic function (LLLR) 11 : φ(f) = log(1 + e f ), or hinge function (RankSVM) 12 : φ(f) = max(0, 1 f). 10 R. E. Schapire and Y. Singer. BoosTexter: A Boosting-based System for Text Categorization. Machine Learning, 39(2/3):135 168, 2000 11 O. Dekel, Ch. Manning, and Y. Singer. Log-linear models for label ranking. In NIPS. MIT Press, 2004 12 A. Elisseeff and J. Weston. A kernel method for multi-labelled classification. In NIPS, pages 681 687, 2001 23 / 44
Surrogate losses φ 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Boolean Test Exponential Logistic Hinge 3 2 1 0 1 2 3 f 24 / 44
Multilabel ranking The pairwise approach is, unfortunately, inconsistent for the most commonly used convex surrogates. 13 There exists a class of pairwise surrogates that is consistent. We will show, however, that the simple univariate (pointwise) variants of the exponential and logistic loss are consistent with the multi-label rank loss. 13 J. Duchi, L. Mackey, and M. Jordan. On the consistency of ranking algorithms. In ICML, pages 327 334, 2010 W. Gao and Z. Zhou. On the consistency of multi-label learning. In COLT, pages 341 358, 2011 25 / 44
Multilabel ranking Let us denote: uv ij = y : y i =u,y j =v w(y)p (y x). uv ij reduces to P (Y i = u, Y j = v x), for w(y) 1. uv ij = vu ji for all (i, j) Let W = E[w(Y ) x] = y w(y)p (y x). Then 00 ij + 01 ij + 10 ij + 11 ij = W. P (y) w Y 1 Y 2 Y 3 0.0 1 0 0 0 0.0 1 0 0 1 0.2 1 0 1 0 0.2 1 0 1 1 0.4 1 1 0 0 0.1 1 1 0 1 0.1 1 1 1 0 0.0 1 1 1 1 10 01 10 01 10 01 12 0.5 12 0.4 13 0.5 13 0.2 23 0.3 23 0.1 26 / 44
Multilabel ranking The conditional risk can be written as: L rnk (h x) = ( 10 ij h i < h j + 01 ij h i > h j i>j + 1 ) 2 ( 10 ij + 01 ij ) h i = h j Ideally, we would like to find h for which: L rnk (h x) = i>j min{ 10 ij, 01 ij }. 27 / 44
Reduction to weighted binary relevance The Bayes ranker can be obtained by sorting labels according to: 14 1 i = y : y i =1 w(y)p (y x). For w(y) 1, the labels should be sorted according to their marginal probabilities, since u i reduces to P (y i = u x) in this case. 14 K. Dembczyński, W. Kot lowski, and E. Hüllermeier. Consistent multilabel ranking through univariate losses. In International Conference on Machine Learning, 2012 28 / 44
Reduction to weighted binary relevance The Bayes risk is indeed: L rnk (h x) = i>j Since 1 i = 10 ij + 11 ij min{ 10 ij, 01 ij }., we have: 1 i 1 j = 10 ij + 11 ij 01 ij 11 ij = 10 ij 01 ij P (y) w Y 1 Y 2 Y 3 0.0 1 0 0 0 0.0 1 0 0 1 0.2 1 0 1 0 0.2 1 0 1 1 0.4 1 1 0 0 0.1 1 1 0 1 0.1 1 1 1 0 0.0 1 1 1 1 i 1 0.6 0.5 0.3 10 01 10 01 10 01 12 0.5 12 0.4 13 0.5 13 0.2 23 0.3 23 0.1 29 / 44
Reduction to weighted binary relevance Consider the univariate (weighted) exponential and logistic loss: l exp (y, h) = w(y) l log (y, h) = w(y) m e (2y i 1)h i, i=1 m i=1 The risk minimizer of these losses is: ) log (1 + e (2y i 1)h i. h i (x) = 1 c log 1 i 0 i = 1 c log 1 i W 1 i, which is a strictly increasing transformation of 1 i, where W = E[w(Y ) x] = y w(y)p (y x). 30 / 44
Reduction to weighted binary relevance Vertical reduction: Solving m independent classification problems. Many algorithms that minimize (weighted) exponential or logistic surrogate, such as AdaBoost or logistic regression, can be applied. Besides its simplicity and efficiency, this approach is consistent. 31 / 44
Regret bound 15 Theorem: Let Reg rnk (h) be the regret for rank loss, and Reg exp (h) and Reg log (h) be the regrets for exponential and logistic losses, respectively. Then 6 Reg rnk (h) 4 C Reg exp (h), 2 Reg rnk (h) 2 C Reg log (h), where C m mw max. 15 K. Dembczyński, W. Kot lowski, and E. Hüllermeier. Consistent multilabel ranking through univariate losses. In International Conference on Machine Learning, 2012 32 / 44
Main result: Sketch of Proof The main idea: to exploit similar regret bounds obtained for bipartite ranking. 16 Reduce horizontally the multilabel ranking problem to the bipartite ranking for each x separately. Since the labels are independent in bipartite ranking, transform the original label distribution to a new auxiliary one with independent labels. Adapt then bounds for the reduced problem with the auxiliary distribution. Finally, return to the original problem. 16 W. Kot lowski, K. Dembczyński, and E. Hüllermeier. Bipartite ranking through minimization of univariate loss. In International Conference on Machine Learning, pages 1113 1120, 2011 33 / 44
Main result: Horizontal reduction X 1 X 2 Y 1 Y 2... Y m x 1 5.0 4.5 1 1 0 x 2 2.0 2.5 0 1 0...... x n 3.0 3.5 0 1 1 x 1 X y Ỹ x 1 1 ỹ 1 1 x 2 2 ỹ 2 1... x m m ỹ m 0 For a given x, we define a bipartite ranking problem by setting X = {1,..., m}; The objects (instances) to be ranked correspond to the label indices of the MLR problem and are of the form x = i, (i = 1,..., m). The corresponding label for x = i is y i. Unfortunately, the labels y i are not necessarily independent. 34 / 44
Main result: Transformation The rank regret depends solely on the marginal weights 1 i : Replace the original distribution P x by the distribution P, for which labels are conditionally independent, P (Ỹ = 1 X = i) = 1 i W, P ( X = i) = 1 m and replace the original weights by w(y) = W. The resulting problem will have the same 1 i. 35 / 44
Main result: Regret bound for an auxiliary problem We adapt the known results for the biparite ranking: Theorem: Let Regbr ( h, P ) be the regret of the (unnormalized) biparite ranking problem, and Reg exp ( h, P ) and Reg log ( h, P ) the corresponsing exponential and logistic loss regrets. Then it holds: Reg br ( h, P 3 ) Reg 2 exp ( h, P ) Reg br ( h, P ) 2 Reg log ( h, P ) 36 / 44
Main result Tracing back We trace back from Reg l ( h, P x) to Reg rnk (h), where l stands for either exponential or logistic loss. P : Reg br ( h, P ) P x: Reg rnk (h x) E P : Reg rnk (h) Reg l ( h, P ) biparite rank. Reg l (h x) cond. MLR E Reg l (h) MLR 37 / 44
Inconsistency of the pairwise approach The conditional risk of pairwise surrogate loss is: L φ (h, P x) = i>j 10 ij φ(h i h j ) + 10 ji φ(h j h i ), and a necessary condition for consistency is that the Bayes classifier h for φ-loss is also the Bayes ranker, i.e., sgn(h i h j) = sgn( 10 ij 01 ij ). The (nonlinear monotone) transformation φ applies to the differences h i h j, so the minimization of the pairwise convex losses result in a complicated solution h, where h i generally depends on all 10 jk (1 j, k m), and not only on 1 i The only case in which the above convex pairwise loss is consistent is when the labels are independent (the case of bipartite ranking). 38 / 44
Experimental results: Synthetic data rank loss 0.170 0.172 0.174 WBR LR LLLR Bayes risk 250 500 1000 2000 4000 8000 16000 # of learning examples rank loss 0.182 0.184 0.186 WBR LR LLLR Bayes risk 250 500 1000 2000 4000 8000 16000 # of learning examples Figure : WBR LR vs. LLLR. Left: independent data. Right: dependent data. Label independence: the methods perform more or less en par. Label dependence: WBR shows small but consistent improvements. 39 / 44
Experimental results: Benchmark data Table : WBR-AdaBoost vs. AdaBoost.MR (left) and WBR-LR vs LLLR (right). dataset AB.MR WBR-AB LLLR WBR-LR image 0.2081 0.2041 0.2047 0.2065 emotions 0.1703 0.1699 0.1743 0.1657 scene 0.0720 0.0792 0.0861 0.0793 yeast 0.2072 0.1820 0.1728 0.1736 mediamill 0.0665 0.0609 0.0614 0.0472 WBR is at least competitive to state-of-the-art algorithms defined on pairwise surrogates. 40 / 44
Outline 1 Ranking problem 2 Multilabel ranking 3 Summary 41 / 44
Summary Ranking problem: different settings. Multi-label ranking. Consistency of multi-label rankers. 42 / 44
Conclusions Take-away message: Multi-label ranking can be solved by a variant of BR. Pairwise approaches are inconsistent. Multi-label ranking is the simplest variant of conditional ranking problems. For more check: http://www.cs.put.poznan.pl/kdembczynski 43 / 44
Thank you for your attention! The project is co-financed by the European Union from resources of the European Social Found. 44 / 44