A Simple Algorithm for Multilabel Ranking

Similar documents
Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Surrogate regret bounds for generalized classification performance metrics

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss

Ordinal Classification with Decision Rules

Relationship between Loss Functions and Confirmation Measures

Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss

On the Consistency of AUC Pairwise Optimization

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification

On Label Dependence in Multi-Label Classification

Listwise Approach to Learning to Rank Theory and Algorithm

Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Large-Margin Thresholded Ensembles for Ordinal Regression

On the Bayes-Optimality of F-Measure Maximizers

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

PAC-learning, VC Dimension and Margin-based Bounds

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Classification objectives COMS 4771

CSCI-567: Machine Learning (Spring 2019)

ECS289: Scalable Machine Learning

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Learning Binary Classifiers for Multi-Class Problem

Boosting with decision stumps and binary features

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Logistic Regression. Machine Learning Fall 2018

CS229 Supplemental Lecture notes

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Boosting: Foundations and Algorithms. Rob Schapire

Stochastic Gradient Descent

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Classification and Pattern Recognition

Large-Margin Thresholded Ensembles for Ordinal Regression

10-701/ Machine Learning - Midterm Exam, Fall 2010

Probabilistic Machine Learning. Industrial AI Lab.

Support Vector Machines

IFT Lecture 7 Elements of statistical learning theory

Voting (Ensemble Methods)

Advanced Machine Learning

CS229 Supplemental Lecture notes

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Statistical Methods for SVM

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Lecture 18: Multiclass Support Vector Machines

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

ECE 5424: Introduction to Machine Learning

Technical Report TUD KE Eyke Hüllermeier, Johannes Fürnkranz. On Minimizing the Position Error in Label Ranking

Progressive Random k-labelsets for Cost-Sensitive Multi-Label Classification

Classification and Support Vector Machine

Analysis and Optimization of Loss Functions for Multiclass, Top-k, and Multilabel Classification

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

CorrLog: Correlated Logistic Models for Joint Prediction of Multiple Labels

PAC-learning, VC Dimension and Margin-based Bounds

Lecture 8. Instructor: Haipeng Luo

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Support Vector Machines for Classification: A Statistical Portrait

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes

ML4NLP Multiclass Classification

VBM683 Machine Learning

Lecture 3: Multiclass Classification

Modified Logistic Regression: An Approximation to SVM and Its Applications in Large-Scale Text Categorization

Neural Networks and Deep Learning

Ensembles. Léon Bottou COS 424 4/8/2010

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Decoupled Collaborative Ranking

Jeff Howbert Introduction to Machine Learning Winter

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

A Blended Metric for Multi-label Optimisation and Evaluation

18.9 SUPPORT VECTOR MACHINES

Algorithms for Predicting Structured Data

6.036 midterm review. Wednesday, March 18, 15

ECS289: Scalable Machine Learning

Support Vector Machines

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Hierarchical Boosting and Filter Generation

Machine Learning for NLP

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Support Vector Machine. Industrial AI Lab.

CIS 520: Machine Learning Oct 09, Kernel Methods

Introduction to Machine Learning Spring 2018 Note 18

1 Training and Approximation of a Primal Multiclass Support Vector Machine

10701/15781 Machine Learning, Spring 2007: Homework 2

The Naïve Bayes Classifier. Machine Learning Fall 2017

CPSC 340: Machine Learning and Data Mining

Boos$ng Can we make dumb learners smart?

Does Modeling Lead to More Accurate Classification?

Lecture 2 Machine Learning Review

Generative v. Discriminative classifiers Intuition

Transcription:

A Simple Algorithm for Multilabel Ranking Krzysztof Dembczyński 1 Wojciech Kot lowski 1 Eyke Hüllermeier 2 1 Intelligent Decision Support Systems Laboratory (IDSS), Poznań University of Technology, Poland 2 Knowledge Engineering & Bioinformatics Lab (KEBI), Marburg University, Germany EURO 2012, Vilnius, Lithuania

Multiclass Classification politics 0 economy 0 business 0 sport 0 tennis 1 soccer 0 show-business 0 celebrities 0. England 0 USA 0 Poland 0 Lithuania 0 2 / 21

Multilabel Classification politics 0 economy 0 business 0 sport 1 tennis 1 soccer 0 show-business 0 celebrities 1. England 1 USA 1 Poland 1 Lithuania 0 2 / 21

Multilabel Ranking tennis sport England Poland USA. politics 2 / 21

Multilabel Ranking The goal is to learn a function h(x) = (h 1 (x),..., h m (x)) such that it ranks for a given vector x X the binary labels y = (y 1,..., y m ) from the most to the least relevant. X 1 X 2 Y 1 Y 2... Y m x 1 5.0 4.5 1 1 0 x 2 2.0 2.5 0 1 0...... x n 3.0 3.5 0 1 1 x 4.0 2.5??? 3 / 21

Multilabel Ranking The goal is to learn a function h(x) = (h 1 (x),..., h m (x)) such that it ranks for a given vector x X the binary labels y = (y 1,..., y m ) from the most to the least relevant. X 1 X 2 Y 1 Y 2... Y m x 1 5.0 4.5 1 1 0 x 2 2.0 2.5 0 1 0...... x n 3.0 3.5 0 1 1 x 4.0 2.5 h 2 > h 1 >... > h m 3 / 21

Multilabel Ranking The goal is to learn a function h(x) = (h 1 (x),..., h m (x)) such that it ranks for a given vector x X the binary labels y = (y 1,..., y m ) from the most to the least relevant. X 1 X 2 Y 1 Y 2... Y m x 1 5.0 4.5 1 1 0 x 2 2.0 2.5 0 1 0...... x n 3.0 3.5 0 1 1 x 4.0 2.5 y 2 y 1... y m 3 / 21

Multilabel Ranking We use rank loss to measure the quality of ranking: ( l rnk (y, h) = w(y) h i (x) < h j (x) + 1 ) 2 h i(x) = h j (x), (i,j): y i >y j where w(y) < w max is a weight function. X 1 X 2 Y 1 Y 2... Y m x 4.0 2.5 1 0 0 h 2 > h 1 >... > h m 4 / 21

Multilabel Ranking We use rank loss to measure the quality of ranking: ( l rnk (y, h) = w(y) h i (x) < h j (x) + 1 ) 2 h i(x) = h j (x), (i,j): y i >y j where w(y) < w max is a weight function. The weight function w(y) is usually used to normalize the range of rank loss to [0, 1]: w(y) = (s y (m s y )) 1, where s y = i y i, i.e., it is equal to the inverse of the total number of pairwise comparisons between labels. 4 / 21

Risk and Bayes Classifier We would like to minimize the expected rank loss, or risk, of h(x): L rnk (h, P )=E [l(y, h(x))]= l rnk (y, h(x)) dp (x, y), The optimal solution can be determined pointwise, for each x X separately: L rnk (h, P x)=e [l rnk (Y, h(x)) x]= y Yl rnk (y, h(x))p (y x). The optimal classifier h, referred to as Bayes classifier, is given then by: h (x) = arg min l rnk (y, s)p (y x). s R m y Y It is often more reasonable to compare a given h to the Bayes classifier h by means of the regret defined by: Reg l (h, P ) = L l (h, P ) L l (h, P ) 5 / 21

Regret and Consistency Since rank loss is neither convex nor differentiable, we want to use a surrogate loss that is easier in optimization and leads to similar results. We say that a surrogate loss l is consistent with the rank loss when the following holds: Reg l (h, P ) 0 Reg rnk (h, P ) 0. 6 / 21

Pairwise Surrogate Losses The most intuitive approach is to use pairwise convex surrogate losses of the form l φ (y, h) = w(y)φ(h i h j ), (i,j): y i >y j where φ is an exponential function (BoosTexter) 1 : φ(f) = e f, logistic function (LLLR) 2 : φ(f) = log(1 + e f ), or hinge function (RankSVM) 3 : φ(f) = max(0, 1 f). 1 R. E. Schapire and Y. Singer. BoosTexter: A Boosting-based System for Text Categorization. Machine Learning, 39(2/3):135 168, 2000 2 O. Dekel, Ch. Manning, and Y. Singer. Log-linear models for label ranking. In NIPS, 2003 3 A. Elisseeff and J. Weston. A kernel method for multi-labelled classification. In NIPS, pages 681 687, 2001 7 / 21

Surrogate Losses φ 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Boolean Test Exponential Logistic Hinge 3 2 1 0 1 2 3 f 8 / 21

Pairwise Surrogate Losses Let us denote: uv ij = y : y i =u,y j =v w(y)p (y x). uv ij reduces to P (Y i = u, Y j = v x), for w(y) 1. uv ij = vu ji for all (i, j) Let W = E[w(Y ) x] = y w(y)p (y x). Then 00 ij + 01 ij + 10 ij + 11 ij = W. P (y) w Y 1 Y 2 Y 3 0.0 1 0 0 0 0.0 1 0 0 1 0.2 1 0 1 0 0.2 1 0 1 1 0.4 1 1 0 0 0.1 1 1 0 1 0.1 1 1 1 0 0.0 1 1 1 1 10 01 10 01 10 01 12 0.5 12 0.4 13 0.5 13 0.2 23 0.3 23 0.1 9 / 21

Pairwise Surrogate Losses The conditional risk can be written as: L rnk (h, P x) = ( 10 ij h i < h j + 01 ij h i > h j i>j + 1 ) 2 ( 10 ij + 01 ij ) h i = h j Its minimum (the Bayes risk) is L rnk (P x) = i>j min{ 10 ij, 01 ij }. While the conditional risk of pairwise surrogate loss is: L φ (h, P x) = i>j 10 ij φ(h i h j ) + 10 ji φ(h j h i ), and a necessary condition for consistency is that the Bayes classifier h for φ-loss is also the Bayes ranker, i.e., sign(h i h j) = sign( 10 ij 01 ij ). 10 / 21

Multilabel Ranking This approach is, however, inconsistent for the most commonly used convex surrogates (Dutchi et al. 2010 4, Gao and Zhou 2011 5 ). The (nonlinear monotone) transformation φ applies to the differences h i h j, so the minimization of the pairwise convex losses result in a complicated solution h, where h i generally depends on all 10 jk (1 j, k m). The only case in which the above convex pairwise loss is consistent is when the labels are independent (the case of bipartite ranking). There exists a class of pairwise surrogates that is consistent, but we will present a different, simpler approach that is also consistent. 4 J. Duchi, L. Mackey, and M. Jordan. On the consistency of ranking algorithms. In ICML, pages 327 334, 2010 5 W. Gao and Z. Zhou. On the consistency of multi-label learning. In COLT, pages 341 358, 2011 11 / 21

Reduction to Weighted Binary Relevance The Bayes ranker can be obtained by sorting labels according to: 1 i = w(y)p (y x). y : y i =1 For w(y) 1, the labels should be sorted according to their marginal probabilities, since u i reduces to P (Y i = u x) in this case (Dembczynski et al. 2010). 6 6 K. Dembczyński, W. Cheng, and E. Hüllermeier. Bayes optimal multilabel classification via probabilistic classifier chains. In ICML, pages 279 286, 2010 12 / 21

Reduction to Weighted Binary Relevance Remind that the minimum (the Bayes risk) is L rnk (P x) = i>j Since 1 i = 10 prove that min{ 10 ij, 01 ij }. ij + 11 ij, we can 1 i 1 j = 10 ij 01 ij. P (y) w Y 1 Y 2 Y 3 0.0 1 0 0 0 0.0 1 0 0 1 0.2 1 0 1 0 0.2 1 0 1 1 0.4 1 1 0 0 0.1 1 1 0 1 0.1 1 1 1 0 0.0 1 1 1 1 i 1 0.6 0.5 0.3 10 01 10 01 10 01 12 0.5 12 0.4 13 0.5 13 0.2 23 0.3 23 0.1 13 / 21

Reduction to Weighted Binary Relevance Consider the univariate (weighted) exponential and logistic loss: l exp (y, h) = w(y) l log (y, h) = w(y) m e (2y i 1)h i, i=1 m i=1 The risk minimizer of these losses is: ) log (1 + e (2y i 1)h i. h i (x) = 1 c log 1 i 0 i = 1 c log 1 i W 1 i, which is a strictly increasing transformation of 1 i, where W = E[w(Y ) x] = y w(y)p (y x). 14 / 21

Main Result 7 Theorem Let Reg rnk (h, P ) be the regret for rank loss, and Reg exp (h, P ) and Reg log (h, P ) be the regrets for exponential and logistic losses, respectively. Then 6 Reg rnk (h, P ) 4 C Reg exp (h, P ), 2 Reg rnk (h, P ) 2 C Reg log (h, P ), where C m mw max. 7 K. Dembczyński, W. Kot lowski, and E. Hüllermeier. Consistent multilabel ranking through univariate losses. In ICML, 2012 15 / 21

Reduction to Weighted Binary Relevance Vertical reduction: Solving m independent classification problems. Many algorithms that minimize (weighted) exponential or logistic surrogate, such as AdaBoost or logistic regression, can be applied. Besides its simplicity and efficiency, this approach is consistent. 16 / 21

Empirical results We use the rank loss with weights defined as: w(y) = (s y (m s y )) 1, where s y = i y i, i.e., the inverse of the total number of pairwise comparisons between labels. We compare algorithms in terms of surrogate losses: The exponential loss: AdaBoost.MR vs WBR-AdaBoost The logistic loss: LLLR vs WBR Logistic Regression Synthetic and benchmark data sets 17 / 21

Logistic loss rank loss 0.170 0.172 0.174 WBR LR LLLR Bayes risk 250 500 1000 2000 4000 8000 16000 # of learning examples rank loss 0.182 0.184 0.186 WBR LR LLLR Bayes risk 250 500 1000 2000 4000 8000 16000 # of learning examples Figure: Left: independent data. Right: dependent data. In the case of label independence, the methods perform more or less en par. In the case where labels are dependent, univariate approach shows small but consistent improvements. 18 / 21

Exponential loss rank loss 0.170 0.180 0.190 WBR AdaBoost AdaBoost.MR Bayes risk 250 500 1000 2000 4000 8000 16000 # of learning examples rank loss 0.185 0.195 0.205 WBR AdaBoost AdaBoost.MR Bayes risk 250 500 1000 2000 4000 8000 16000 # of learning examples Figure: Left: independent data. Right: dependent data. Strange behavior of AdaBoost.MR: for more than 20 stumps it quickly overfits. In both cases, the univariate approach outperforms the pairwise approach. 19 / 21

Benchmark Data Table: Exponential loss-based (left) and logistic loss-based algorithms (right). For each dataset, the winner out of the two competing algorithms is marked by a *. dataset AB.MR WBR-AB LLLR WBR-LR image 0.2081 *0.2041 *0.2047 0.2065 emotions 0.1703 *0.1699 0.1743 *0.1657 scene *0.0720 0.0792 0.0861 *0.0793 yeast 0.2072 *0.1820 *0.1728 0.1736 mediamill 0.0665 *0.0609 0.0614 *0.0472 The simple reduction algorithms trained independently on each label are at least competitive to state-of-the-art algorithms defined on pairwise surrogates. 20 / 21

Conclusions We have shown that common univariate convex surrogates are consistent for mutlilabel ranking. We proved explicit regret bounds, relating ranking regret to univariate loss regret. The results are arguably surprising in light of the previous ones, where inconsistency is shown for the most popular pairwise surrogates. On the more practical side, our results motivate simple and scalable algorithms for multilabel ranking, which are plain modifications of standard algorithms for classification. This project is partially supported by the Foundation of Polish Science under the Homing Plus programme, co-financed by the European Regional Development Fund.