Relationship between Loss Functions and Confirmation Measures

Similar documents
Ordinal Classification with Decision Rules

Statistical Model for Rough Set Approach to Multicriteria Classification

ENSEMBLES OF DECISION RULES

Similarity-based Classification with Dominance-based Decision Rules

A Simple Algorithm for Multilabel Ranking

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Beyond Sequential Covering Boosted Decision Rules

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Stochastic Dominance-based Rough Set Model for Ordinal Classification

Margin Maximizing Loss Functions

i=1 = H t 1 (x) + α t h t (x)

LogitBoost with Trees Applied to the WCCI 2006 Performance Prediction Challenge Datasets

Learning Binary Classifiers for Multi-Class Problem

LEAST ANGLE REGRESSION 469

Why does boosting work from a statistical view

Additive Preference Model with Piecewise Linear Components Resulting from Dominance-based Rough Set Approximations

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Bayesian Learning (II)

ECE 5424: Introduction to Machine Learning

Ensembles. Léon Bottou COS 424 4/8/2010

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Multivariate statistical methods and data mining in particle physics

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Dariusz Brzeziński, Zbigniew Grudziński, Izabela Szczęch. Poznan University of Technology, Poland

Recap from previous lecture

A ROBUST ENSEMBLE LEARNING USING ZERO-ONE LOSS FUNCTION

Data Mining und Maschinelles Lernen

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Machine Learning Linear Models

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Announcements Kevin Jamieson

Gradient Boosting (Continued)

An Introduction to Statistical and Probabilistic Linear Models

Harrison B. Prosper. Bari Lectures

Logistic Regression. Machine Learning Fall 2018

Short Note: Naive Bayes Classifiers and Permanence of Ratios

Fitting Aggregation Functions to Data: Part II Idempotization

Machine Learning. Ensemble Methods. Manfred Huber

CS7267 MACHINE LEARNING

Surrogate regret bounds for generalized classification performance metrics

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Chapter 6. Ensemble Methods

A Simple Algorithm for Learning Stable Machines

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Ensemble Methods for Machine Learning

Logistic Regression and Boosting for Labeled Bags of Instances

PDEEC Machine Learning 2016/17

Ensembles of Classifiers.

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

A Study of Relative Efficiency and Robustness of Classification Methods

Monte Carlo Theory as an Explanation of Bagging and Boosting

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Learning Linear Detectors

Geometry of U-Boost Algorithms

Linear Regression and Discrimination

Chapter 14 Combining Models

Classification of Voice Signals through Mining Unique Episodes in Temporal Information Systems: A Rough Set Approach

Solving Classification Problems By Knowledge Sets

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

Stochastic Gradient Descent

SPECIAL INVITED PAPER

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

Lecture 8. Instructor: Haipeng Luo

COMS 4771 Lecture Boosting 1 / 16

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Self Supervised Boosting

A Brief Introduction to Adaboost

Machine Learning Lecture 7

Does Modeling Lead to More Accurate Classification?

Drawing Conclusions from Data The Rough Set Way

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA

CHARACTERISTICS IN FLIGHT DATA - ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

A new Approach to Drawing Conclusions from Data A Rough Set Perspective

Multivariate Analysis Techniques in HEP

Classification. Chapter Introduction. 6.2 The Bayes classifier

Machine Learning Lecture 5

CS 231A Section 1: Linear Algebra & Probability Review

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Hierarchical Boosting and Filter Generation

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

On Label Dependence in Multi-Label Classification

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Advanced Machine Learning

VBM683 Machine Learning

Linear discriminant functions

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Transcription:

Relationship between Loss Functions and Confirmation Measures Krzysztof Dembczyński 1 and Salvatore Greco 2 and Wojciech Kotłowski 1 and Roman Słowiński 1,3 1 Institute of Computing Science, Poznań University of Technoy, 60-965 Poznań, Poland {kdembczynski, wkotlowski, rslowinski}@cs.put.poznan.pl 2 Faculty of Economics, University of Catania, 95129 Catania, Italy salgreco@unict.it 3 Institute for Systems Research, Polish Academy of Sciences, 01-447 Warsaw, Poland Abstract. In the paper, we present the relationship between loss functions and confirmation measures. We show that population minimizers for weighted loss functions correspond to confirmation measures. This result can be used in construction of machine learning methods, particularly, ensemble methods. 1 Introduction Let us define the prediction problem in a similar way as in [4]. The aim is to predict the unknown value of an attribute y (sometimes called output, response variable or decision attribute) of an object using the known joint values of other attributes (sometimes called predictors, condition attributes or independent variables) x = (x 1, x 2,..., x n ). We consider binary classification problem, in which we assume that y { 1, 1}. All objects for which y = 1 constitute decision class Cl 1, and all objects for which y = 1 constitute decision class Cl 1. The goal of a learning task is to find a function F (x) (in general, F (x) R) using a set of training examples {y i, x i } N 1 that predicts accurately y (in other words, classifies accurately objects to decision classes). The optimal classification procedure is given by: E yxl(y, F (x)), (1) F (x) where the expected value E yx is over joint distribution of all variables (y, x) for the data to be predicted. L(y, F (x)) is a loss or cost for predicting F (x) when the actual value is y. E yx L(y, F (x)) is often called prediction risk. Nevertheless, the learning procedure can use only a set of training examples {y i, x i } N 1. Using this set, it tries to construct F (x) to be the best possible approximation of F (x). The typical loss function in binary classification tasks is, so called, 0-1 loss: L 0 1 (y, F (x)) = { 0 if yf (x) > 0, 1 if yf (x) 0. (2)

It is possible to use other loss functions than (2). Each of these functions has some interesting properties. One of them is a population minimizer of prediction risk. By conditioning (1) on x (i.e., factoring the joint distribution P (y, x) = P (x)p (y x)), we obtain: F (x) E xe y x L(y, F (x)). (3) It is easy to see that it suffices to minimize (3) pointwise: F (x) E y xl(y, F (x)). (4) The solution of the above is called population minimizer. In other words, this is an answer to a question: what does a minimization of expected loss estimate on a population level? Let us remind that the population minimizer for 0-1 loss function is: F (x) = { 1 if 1 2, 1 if P (y = 1 x) > 1 2. (5) From the above, it is easy to see that minimizing 0-1 loss function one estimates a region in predictor space in which class Cl 1 is observed with the higher probability than class Cl 1. Minimization of some other loss functions can be seen as an estimation of conditional probabilities (see Section 2). From the other side, Bayesian confirmation measures (see, for example, [5,9]) have paid a special attention in knowledge discovery [7]. Confirmation measure c(h, E) says in what degree a piece of evidence E confirms (or disconfirms) a hypothesis H. It is required to satisfy: > 0 if P (H E) > P (H), c(h, E) = = 0 if P (H E) = P (H), < 0 if P (H E) < P (H), where P (H) is the probability of hypothesis H and P (H E) is the conditional probability of hypothesis H given evidence E. In Section 3, two confirmation measures of a particular interest are discussed. In this paper, we present relationship between loss functions and confirmation measures. The motivation of this study is a question: what is the form of the loss function for estimating a region in predictor space in which class Cl 1 is observed with the positive confirmation, or alternatively, for estimating confirmation measure for a given x and y? In the following, we show that population minimizers for weighted loss functions correspond to confirmation measures. Weighted loss functions are often used in the case of imbalanced class distribution, i.e., when probabilities P (y = 1) and P (y = 1) are substantially different. This result is described in Section 4. The paper is concluded in the last section. (6)

Loss 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 1 loss Exponential loss Binomial likelihood loss Squared error loss 2 1 0 1 2 Fig. 1. The most popular loss functions (figure prepared in R [10]; similar figure may be found in [8], also prepared in R) yf(x) 2 Loss Functions There are different loss functions used in prediction problems (for a wide discussion see [8]). In this paper, we consider, besides 0-1 loss, the following three loss functions for binary classification: exponential loss: L exp (y, F (x)) = exp( yf (x)), (7) binomial negative -likelihood loss: squared-error loss: L (y, F (x)) = (1 + exp( 2yF (x))), (8) L sqr (y, F (x)) = (y F (x)) 2 = (1 yf (x)) 2. (9) These loss functions are presented in Figure 2. Exponential loss is used in AdaBoost [6]. Binomial negative -likelihood loss is common in statistical approaches. It is also used in Gradient Boosting Machines [3]. The reformulation of the squared-error loss (9) is possible, because y { 1, 1}. Squared-error loss is not a monotone decreasing function of increasing yf (x). For values yf (x) > 1 it increases quadratically. For this reason, one has to use this loss function in classification task very carefully.

The population minimizers for these loss functions are as follows: E y xl exp (y, F (x)) = 1 F (x) 2 P (y = 1 x), E y xl (y, F (x)) = 1 F (x) 2 P (y = 1 x), E y xl sqr (y, F (x)) = P (y = 1 x). F (x) From these formulas, it is easy to get values of. 3 Confirmation Measures There are two confirmation measures of a particular interest: l(h, E) = P (E H) P (E H), (10) f(h, E) = P (E H) P (E H) P (E H) + P (E H), (11) where H is hypothesis, and E is evidence. Measures l and f satisfy two desired properties that are: hypothesis symmetry: c(h, E) = c( H, E) (for details, see for example [5]), and monotonicity property M defined in terms of rough set confirmation measures (for details, see [7]). Let us remark that in the binary classification problem, one tries for a given x to predict value y { 1, 1}. In this case, evidence is x, and hypotheses are then y = 1 and y = 1. Confirmation measures (10) and (11) take the following form: P (x y = 1) l(y = 1 x) = P (x y = 1), (12) f(y = 1 x) = P (x y = 1) P (x y = 1) P (x y = 1) + P (x y = 1). (13) 4 Population Minimizers for Weighted Loss Functions In this section, we present our main results that show relationship between loss functions and confirmation measures. We prove that population minimizers for weighted loss functions correspond to confirmation measures. Weighted loss functions are often used in the case of imbalanced class distribution, i.e., when probabilities P (y = 1) and P (y = 1) are substantially different. Weighted loss function can be defined as follows: L w (y, F (x)) = w L(y, F (x)),

where L(y, F (x)) is one of the loss functions presented above. Assuming that P (y) is known, one can take w = 1/P (y), and then: L w (y, F (x)) = 1 L(y, F (x)). (14) P (y) In the proofs presented below, we use the following well-known facts: Bayes theorem: = P (y = 1 x)/p (x) = P (x y = 1)P (y = 1)/P (x); and P (y = 1) = 1 P (y = 1) and = 1 P (y = 1 x). Let us consider the following weighted 0-1 loss function: L w 0 1(y, F (x)) = 1 P (y) { 0 if yf (x) > 0, 1 if yf (x) 0. (15) Theorem 1. Population minimizer of E y x L w 0 1(y, F (x)) is: { F 1 if P (y = 1), (x) = 1 if P (y = 1 x) > P (y = 1) { 1 if c(y = 1, x) 0, = 1 if c(y = 1, x) > 0. (16) where c is any confirmation measure. Proof. We have that Prediction risk is then: F (x) E y xl w 0 1(y, F (x)). E y x L w 0 1(y, F (x)) = L w 0 1(1, F (x)) + P (y = 1 x)l w 0 1( 1, F (x)), E y x L w 0 1(y, F (x)) = P (y = 1) L P (y = 1 x) 0 1(1, F (x)) + P (y = 1) L 0 1( 1, F (x)). This is minimized, if either /P (y = 1) P (y = 1 x)/p (y = 1) for any F (x) > 0, or /P (y = 1) < P (y = 1 x)/p (y = 1) for any F (x) < 0 (in other words, only the sign of F (x) is important). From P (y = 1 x)/p (y = 1) P (y = 1 x)/p (y = 1), we have that: P (y = 1) 1 1 P (y = 1), which finally gives P (y = 1) or c(y = 1, x) 0. Anaously, from /P (y = 1) < P (y = 1 x)/p (y = 1), we obtain that P (y = 1 x) > P (y = 1) or c(y = 1, x) > 0. From the above we get the thesis. From the above theorem, it is easy to see that minimization of L w 0 1(y, F (x)) results in estimation of a region in predictor space in which class Cl 1 is observed with a positive confirmation. In the following theorems, we show that

minimization of a weighted version of an exponential, a binomial negative likelihood, and a squared-loss error loss function gives an estimate of a particular confirmation measure, l or f. Let us consider the following weighted exponential loss function: L w exp(y, F (x)) = 1 exp( y F (x)). (17) P (y) Theorem 2. Population minimizer of E y x L w exp(y, F (x)) is: Proof. We have that Prediction risk is then: F (x) = 1 2 P (x y = 1) P (x y = 1) = 1 l(y = 1, x). (18) 2 F (x) E y xl w exp(y, F (x)). E y x L w exp(y, F (x)) = L w exp(1, F (x)) + P (y = 1 x)l w exp( 1, F (x)), E y x L w exp(y, F (x)) = P (y = 1) exp( F (x)) + P (y = 1 x) P (y = 1) Let us compute a derivative of the above expression: E y x L w exp(y, F (x)) F (x) Setting the derivative to zero, we get: exp(2f (x)) = F (x) = 1 2 exp(f (x)). P (y = 1 x) = exp( F (x)) + exp(f (x)). P (y = 1) P (y = 1) P (y = 1) P (y = 1), P (y = 1) P (y = 1) = 1 l(y = 1, x). 2 Let us consider the following weighted binomial negative -likelihood loss function: L w (y, F (x)) = 1 (1 + exp( 2y F (x))). (19) P (y) Theorem 3. Population minimizer of E y x L w (y, F (x)) is: F (x) = 1 2 P (x y = 1) P (x y = 1) = 1 l(y = 1, x). (20) 2

Proof. We have that Prediction risk is then: F (x) E y xl w (y, F (x)). E y x L w (y, F (x)) = L w (1, F (x)) + P (y = 1 x)l w ( 1, F (x)) = P (y = 1) (1 + exp( 2F (x))) + P (y = 1 x) P (y = 1) Let us compute a derivative of the above expression: (1 + exp(2f (x))). E y x L w (y, F (x)) exp( 2F (x)) = 2 F (x) P (y = 1) 1 + exp( 2F (x)) + P (y = 1 x) exp(2f (x)) + 2 P (y = 1) 1 + exp(2f (x)). Setting the derivative to zero, we get: exp(2f (x)) = F (x) = 1 2 P (y = 1) P (y = 1 x)p (y = 1) P (y = 1) P (y = 1 x)p (y = 1) = 1 l(y = 1, x). 2 Let us consider the following weighted squared-error loss function: L w sqr(y, F (x)) = 1 P (y) (y F (x))2. (21) Theorem 4. Population minimizer of E y x L w sqr(y, F (x)) is: F (x) = Proof. We have that Prediction risk is then: P (x y = 1) P (x y = 1) = f(y = 1, x). (22) P (x y = 1) + P (x y = 1) F (x) E y xl w sqr(y, F (x)). E y x L w sqr(y, F (x)) = L w sqr(1, F (x)) + P (y = 1 x)l w sqr( 1, F (x)), E y x L w sqr(y, F (x)) = P (y = 1) (1 F P (y = 1 x) (x))2 + P (y = 1) (1 + F (x))2. Let us compute a derivative of the above expression: E y x L w (y, F (x)) = 2 F (x) P (y = 1) (y = 1 x) (1 F (x)) + 2P (1 + F (x)). P (y = 1)

Setting the derivative to zero, we get: /P (y = 1) P (y = 1 x)/p (y = 1) F (x) = /P (y = 1) + P (y = 1 x)/p (y = 1), F (x) = 5 Conclusions P (x y = 1) P (x y = 1) = f(y = 1, x). P (x y = 1) + P (x y = 1) We have proven that population minimizers for weighted loss functions correspond directly to confirmation measures. This result can be applied in construction of machine learning methods, for example, ensemble classifiers producing a linear combination of base classifiers. In particular, considering ensemble of decision rules [1,2], a sum of outputs of rules that cover x can be interpreted as an estimate of a confirmation measure for x and a predicted class. Our future research will concern investigation of general conditions that loss function has to satisfy to be used in estimation of confirmation measures. References 1. Błaszczyński, J., Dembczyński, K., Kotłowski, W., Słowiński, R., Szeląg, M.: Ensemble of Decision Rules. Research Report RA-011/06, Poznań University of Technoy (2006) 2. Błaszczyński, J., Dembczyński, K., Kotłowski, W., Słowiński, R., Szeląg, M.: Ensembles of Decision Rules for Solving Binary Classification Problems with Presence of Missing Values. In: Greco et al. (eds.): Rough Sets and Current Trends in Computing, Lecture Notes in Artificial Intelligence, Springer-Verlag 4259 (2006) 318 327 3. Friedman, J. H.: Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 5 29 (2001) 1189 1232 4. Friedman, J. H.: Recent Advances in Predictive (Machine) Learning. Dept. of Statistics, Stanford University, http://www-stat.stanford.edu/~jhf (2003) 5. Fitelson, B.: Studies in Bayesian Confirmation Theory. Ph.D. Thesis, University of Wisconsin, Madison (2001) 6. Freund, Y., Schapire, R. E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 1997 119 139 7. Greco, S., Pawlak, Z., Słowiński, R.: Can Bayesian confirmation measures be useful for rough set decision rules? Engineering Applications of Artificial Intelligence 17 (2004) 345 361 8. Hastie, T., Tibshirani, R., Friedman, J. H.: Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer (2003) 9. Kyburg, H.: Recent work in inductive ic. In: Lucey, K.G., Machan, T.R. (eds.): Recent Work in Philosophy. Rowman and Allanheld, Totowa, NJ, (1983) 89 150 10. R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, http://www.r-project.org, Vienna, (2005)