Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss

Similar documents
Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss

Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification

A Simple Algorithm for Multilabel Ranking

On Label Dependence in Multi-Label Classification

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Surrogate regret bounds for generalized classification performance metrics

A Deep Interpretation of Classifier Chains

On the Bayes-Optimality of F-Measure Maximizers

Multi-label learning from batch and streaming data

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Machine Learning. Lecture 9: Learning Theory. Feng Li.

An Ensemble of Bayesian Networks for Multilabel Classification

Consistency of Nearest Neighbor Methods

Progressive Random k-labelsets for Cost-Sensitive Multi-Label Classification

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

On the Optimality of Multi-Label Classification under Subset Zero-One Loss for Distributions Satisfying the Composition Property

Optimal Classification with Multivariate Losses

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

Naïve Bayes classification

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Bayesian Methods: Naïve Bayes

Optimization of Classifier Chains via Conditional Likelihood Maximization

1 A Lower Bound on Sample Complexity

Computational and Statistical Learning Theory

Graded Multilabel Classification: The Ordinal Case

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

10.1 The Formal Model

The sample complexity of agnostic learning with deterministic labels

IFT Lecture 7 Elements of statistical learning theory

Lecture 3: Statistical Decision Theory (Part II)

Solving Classification Problems By Knowledge Sets

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Generative v. Discriminative classifiers Intuition

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Recap from previous lecture

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Classifier Chains for Multi-label Classification

CIS 520: Machine Learning Oct 09, Kernel Methods

Computational Learning Theory

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

PAC-learning, VC Dimension and Margin-based Bounds

Logistic Regression. Machine Learning Fall 2018

Parameter learning in CRF s

Understanding Generalization Error: Bounds and Decompositions

Decision trees COMS 4771

Large-Margin Thresholded Ensembles for Ordinal Regression

Lecture 21: Minimax Theory

CMU-Q Lecture 24:

Lecture 2 Machine Learning Review

Statistical learning theory, Support vector machines, and Bioinformatics

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Bias-Variance Tradeoff

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Statistical and Computational Learning Theory

Discriminative Learning can Succeed where Generative Learning Fails

Does Unlabeled Data Help?

Computational Learning Theory. CS534 - Machine Learning

Part of the slides are adapted from Ziko Kolter

Hands-On Learning Theory Fall 2016, Lecture 3

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

MIRA, SVM, k-nn. Lirong Xia

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Learning with Rejection

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Generative MaxEnt Learning for Multiclass Classification

Warm up: risk prediction with logistic regression

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

A Blended Metric for Multi-label Optimisation and Evaluation

Machine Learning in the Data Revolution Era

CS 231A Section 1: Linear Algebra & Probability Review

Payment Rules for Combinatorial Auctions via Structural Support Vector Machines

Introduction to Machine Learning

Support Vector Machines and Kernel Methods

The Bayes classifier

Linear discriminant functions

Loss Functions, Decision Theory, and Linear Models

On the VC-Dimension of the Choquet Integral

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Discriminative Models

Computational Learning Theory

The Learning Problem and Regularization

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Click Prediction and Preference Ranking of RSS Feeds

Surrogate loss functions, divergences and decentralized detection

Advanced Introduction to Machine Learning CMU-10715

Manual for a computer class in ML

MLCC 2017 Regularization Networks I: Linear Models

Foundations of Machine Learning

Statistical Machine Learning from Data

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Boosting: Foundations and Algorithms. Rob Schapire

Introduction to Machine Learning Spring 2018 Note 18

Models of Language Acquisition: Part II

Transcription:

Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss Krzysztof Dembczyński 1, Willem Waegeman 2, Weiwei Cheng 1, and Eyke Hüllermeier 1 1 Knowledge Engineering and Bioinformatics () Lab 2 Research Unit Knowledge-based Systems (RMIT) Universiteit Gent ECMLPKDD 2010, Barcelona, September 23, 2010

Multi-Label Classification Given a vector x X of features, the goal is to learn a function h(x) that predicts accurately a binary vector y = (y 1,..., y m ) Y of labels. Example: Given a news report, the goal is to learn a machine that tags the news report with relevant categories. The simple solution: solve the problem for each of the label independently. 2 / 28

Multi-Label Classification Since the prediction is made for all labels simultaneously, two interesting issues appear: A multitude of loss functions defined over multiple labels, Dependence/correlation between labels. 3 / 28

Pitfalls in MLC Studies In recent years, a plenty of algorithms has been introduced... 4 / 28

Pitfalls in MLC Studies A large number of loss functions is commonly applied as performance metrics, but a concrete connection between a multi-label classifier and a loss function is rarely established. 5 / 28

Pitfalls in MLC Studies A large number of loss functions is commonly applied as performance metrics, but a concrete connection between a multi-label classifier and a loss function is rarely established. This gives implicitly the misleading impression that the same method can be optimal for different loss functions. 5 / 28

Pitfalls in MLC Studies A large number of loss functions is commonly applied as performance metrics, but a concrete connection between a multi-label classifier and a loss function is rarely established. This gives implicitly the misleading impression that the same method can be optimal for different loss functions. It is assumed that performance can be improved by taking the label dependence into account, but this term is used in an intuitive manner, without any precise formal definition. 5 / 28

Pitfalls in MLC Studies The empirical results are given on average without investigation under which conditions a given algorithm benefits. 6 / 28

Pitfalls in MLC Studies The empirical results are given on average without investigation under which conditions a given algorithm benefits. The reasons for improvements are not distinguished. 6 / 28

Loss Functions Let us discuss multi-label loss functions... 7 / 28

Hamming and Subset 0/1 Loss Hamming loss measures the fraction of labels whose relevance is incorrectly predicted: L H (y, h(x)) = 1 m m y i h i (x), i=1 while subset 0/1 loss measures whether the prediction totally agrees with the true labeling: L s (y, h(x)) = y h(x). 8 / 28

Analysis of Hamming and Subset 0/1 Loss Analysis contains: The form of risk minimizers, Whether the risk minimizers coincide in some circumstances, Bound analysis, Regret analysis. The analysis is simplified by assuming an unconstrained hypothesis space. 9 / 28

Risk Minimizers The risk minimizer defined by: h (x) = arg min E Y x L(Y, h), h is given for the Hamming loss by: h i (x) = arg max b {0,1} P(y i = b x), i = 1,..., m, while for the subset 0/1 loss by: h (x) = arg max P(y x). y Y The minimizer of the Hamming loss is the marginal mode, while for the subset 0/1 loss the joint mode. 10 / 28

Coincidence of Risk Minimizers Proposition The Hamming loss and subset 0/1 have the same risk minimizer, h H(x) = h s(x), if one of the following conditions holds: (1) Labels Y 1,..., Y m are conditionally m-independent, m P(Y x) = P(Y i x). i=1 (2) The probability of the joint mode satisfies, P(h s(x) x) 0.5. 11 / 28

Bound Analysis Proposition For all distributions of Y given x, and for all models h, the expectation of the subset 0/1 loss can be bounded in terms of the expectation of the Hamming loss as follows: 1 m E Y[L s (Y, h(x))] E Y [L H (Y, h(x))] E Y [L s (Y, h(x))] 12 / 28

Regret Analysis The previous results may suggest that one of the loss functions can be used as a proxy of the other: For some situations both risk minimizers coincide, One can provide mutual bounds for both loss functions, However, in the worst case analysis, we will show that the regret is high... 13 / 28

Regret Analysis The regret of a classifier h with respect to a loss function L z is defined as: r Lz (h) = E XY L z (Y, h(x)) E XY L z (Y, h z(x)), where expectation is taken over the joint distribution P(X, Y), and h z is the Bayes-optimal classifier with respect to the loss function L z. Since both loss functions are decomposable with respect to individual instances, we analyze the expectation of Y for a given x. 14 / 28

Regret Analysis Proposition (Regret for subset 0/1 loss) The following upper bound holds: E Y L s (Y, h H(x)) E Y L s (Y, h s(x)) < 0.5. Moreover, this bound is tight, i.e., sup (E Y L s (Y, h H(x)) E Y L s (Y, h s(x))) = 0.5, P where the supremum is taken over all probability distributions on Y. 15 / 28

Regret Analysis Proposition (Regret for Hamming loss) The following upper bound holds for m > 3: E Y L H (Y, h s(x)) E Y L H (Y, h H(x)) < m 2 m + 2. Moreover, this bound is tight, i.e. sup(e Y L H (Y, h s(x)) E Y L H (Y, h H(x))) = m 2 P m + 2, where the supremum is taken over all probability distributions on Y. 16 / 28

Loss Functions Summary: The risk minimizers of Hamming and subset 0/1 loss have a different structure: marginal mode vs. joint mode. Under specific conditions, these two types of loss minimizers are provably equivalent. These loss functions are mutually upper-bounded. Minimization of the subset 0/1 loss may cause a high regret for the Hamming loss and vice versa. 17 / 28

Empirical Confirmation of Theoretical Results Let us empirically confirm theoretical results... 18 / 28

Binary Relevance (BR) The simplest classifier in which a separate binary classifier h i ( ) is trained for each label λ i : h i : X [0, 1] x y i {0, 1} It is often criticized for treating labels independently. However, it is still an unbiased approach for the Hamming loss (and other losses for which the marginal distribution is sufficient for obtaining the risk-minimizing model). 19 / 28

Label Power-set (LP) The method reduces the problem to multi-class classification by considering each label subset L L as a distinct meta-class: h : X [0, 1] m x y {0, 1} m It is often claimed to be a right approach to MLC, since it takes the label dependence into account. However, this approach is clearly tailored for the subset 0/1 loss. 20 / 28

Experimental Settings The artificial data experiment: conditionally independent data, conditionally dependent data, non-linear data, low-dimensional problems with 2 or 3 labels, two classifiers: Binary Relevance (BR), Label Power-set (LP) with linear SVM as base learners. 21 / 28

Table: Results on two artificial data sets: conditionally independent (top) and conditionally dependent (down). Conditional independence classifier Hamming loss subset 0/1 loss BR 0.4208(±.0014) 0.8088(±.0020) LP 0.4212(±.0011) 0.8101(±.0025) B-O 0.4162 0.8016 Conditional dependence classifier Hamming loss subset 0/1 loss BR 0.3900(±.0015) 0.7374(±.0021) LP 0.4227(±.0019) 0.6102(±.0033) B-O 0.3897 0.6029 B-O is the Bayes Optimal classifier. 22 / 28

Figure: Data set composed of two labels: the first label is obtained by a linear model, while the second label represents the XOR problem. 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 23 / 28

Table: Results of three classifiers on this data set. classifier Hamming subset 0/1 loss loss BR Linear SVM 0.2399(±.0097) 0.4751(±.0196) LP Linear SVM 0.0143(±.0020) 0.0195(±.0011) B-O 0 0 24 / 28

Table: Results of three classifiers on this data set. classifier Hamming subset 0/1 loss loss BR Linear SVM 0.2399(±.0097) 0.4751(±.0196) LP Linear SVM 0.0143(±.0020) 0.0195(±.0011) BR MLRules 0.0011(±.0002) 0.0020(±.0003) B-O 0 0 24 / 28

Experimental Settings Summary: LP takes the label dependence into account, but the conditional one: it is well-tailored for the subset 0/1 loss, but fails for the Hamming loss. LP may gain from the expansion of the feature or hypothesis space. One can easily tailor LP for solving the Hamming loss minimization problem, by marginalization of the joint probability distribution that is a by-product of this classifier. 25 / 28

Figure: Results of three classifiers on 8 benchmark data sets. subset 0/1 loss 0.0 0.2 0.4 0.6 0.8 Classifiers 0.00 0.05 0.10 0.15 0.20 Hamming loss BR Linear SVM LP Linear SVM BR MLRules Data sets image scene emotions reuters yeast genbase slashdot medical The experimental results on benchmark data confirm our claims. 26 / 28

Take-Away Message The message to be taken home... 27 / 28

Take-Away Message Surprisingly, new methods are often proposed without explicitly saying what loss they intend to minimize. A careful distinction between loss functions seems to be even more important for MLC than for standard classification. One cannot expect the same MLC method to be optimal for different types of losses. The reasons of improvements should be carefully distinguished. 28 / 28