On distance measures, surrogate loss functions, and distributed detection

Size: px

Start display at page:

Download "On distance measures, surrogate loss functions, and distributed detection"

Shanna Dorsey
5 years ago
Views:

1 On distance measures, surrogate loss functions, and distributed detection XuanLong Nguyen Computer Science Division University of California, Berkeley Martin J. Wainwright EECS Department and Statistics Department University of California, Berkeley Michael I. Jordan EECS Department and Statistics Department University of California, Berkeley September 12, 2005 Technical Report Department of Statistics University of California, Berkeley Abstract In this paper, we show the correspondence between distance measures and surrogate loss functions in the context of decentralized binary hypothesis testing. This correspondence helps explicate the use of various distance measures in signal processing and quantization theory, as well as explain the behavior of surrogate loss functions often used in machine learning and statistics. We then develop a notion of equivalence among distance measures, and among loss functions. Finally, we investigate the statistical behavior of a nonparametric decentralized hypothesis testing algorithm by minimizing convex surrogate loss functions that are equivalent to the 0-1 loss. 1 Introduction Discriminant analysis has undergone significant and sustained development over several decades in various engineering fields, where elaborations of the basic paradigm have been developed that are responsive to particular constellations of physical, informational and computational constraints. For example, research in the area of distributed detection focuses on problems in which measurements are obtained by physicallydistributed devices which, due to power and bandwidth limitations, send quantized versions of their measurements to a central site where detection decisions are made (Tsitsiklis, 1993b, Blum et al., 1997). This problem is of significant current interest in the field of sensor networks (e.g. Chamberland and Veeravalli, 2003). Similar problems known as signal selection problems also blend discriminant analysis with aspects of experimental design (Kailath, 1967). 1

2 Such problems are generally formulated as hypothesis-testing problems, either within a Neyman-Pearson or Bayesian framework. Unfortunately, these formulations rarely lead to computationally tractable algorithms, and much of the focus has been on defining surrogates for the probability of error that lead to practical algorithms. For example, Hellinger distance has been championed for distributed detection problems, due to the fact that it yields a tractable algorithm both for the experimental design aspect of the problem (the choice of quantization rules) and the discriminant analysis aspect of the problem (Longo et al., 1990). More broadly, a class of functions known as Ali-Silvey distances or f-divergences which include Hellinger distance, as well as variational distance, Kullback-Leibler (KL) divergence and Chernoff distance have been explored as criteria to yield tractable approximations to the probability of error in a wide variety of applied discrimination problems (Ali and Silvey, 1966, Csiszaŕ, 1967). Theoretical support for the use of f-divergences in discrimination problems comes from two main sources. First, a classical result of Blackwell (1951) establishes that if procedure A has a smaller f- divergence than procedure B (for some particular f-divergence) then there must exist some set of prior probabilities such that procedure A has a smaller probability of error than procedure B. This is a weak justification, but it has proved useful in designing signal selection and quantization rules (Kailath, 1967, Poor and Thomas, 1977, Longo et al., 1990). Second, f-divergences often arise as exponents in asymptotic characterizations of the optimal rate of convergence in hypothesis-testing problems; examples include KL divergence (for the Neyman-Pearson problem) and Chernoff distance (for the Bayesian formulation). A parallel, more recent, line of research in the field of statistical machine learning has also focused on computationally-motivated surrogate functions in discriminant analysis. In statistical machine learning the formulation of discrimination problems (also known as classification problems) is decision-theoretic, with the Bayes error interpreted as risk under a 0-1 loss, with the algorithmic goal being that of minimizing the empirical expectation of 0-1 loss, and with empirical process theory providing the underlying framework for theoretical analysis. In this setting, the nonconvexity of the 0-1 loss is viewed as the source of the intractability of the minimization of probability of error, and researchers have studied algorithms that are based on replacing the 0-1 loss with surrogate loss functions that are convex upper bounds on the 0-1 loss (see Figure 1). A wide variety of practically successful machine learning algorithms have been based on this tactic, including the support vector machine (Schölkopf and Smola, 2002), AdaBoost(Freund and Schapire, 1997), X4 (Breiman, 1998) and logistic regression Friedman et al. (2000). Theoretical support for this line of research comes from the results of Bartlett et al. (2005), Zhang (2004), and others, who have provided characterizations of the class of surrogate loss functions in terms of consistency of the resulting estimation procedures and have shown how the rate of convergence to the Bayes optimal risk depends on properties of the surrogate loss functions. The f-divergences studied in information theory and the surrogate loss functions studied in statistical machine learning are different mathematical objects the former are functions on pairs of measures while the latter are functions on values of discriminant functions and class labels but their underlying role in obtaining computationally-tractable algorithms for discriminant analysis suggests that they should be related. Indeed, Blackwell s result hints at such a relationship, but its focus on 0-1 loss does not lend itself to developing relationships between specific f-divergences and specific surrogate loss functions. In the current paper we analyze the relationship between f-divergences and surrogate loss functions in detail, presenting a full characterization of the connection. We show that for any expected surrogate loss, regardless of its convexity, there exists a corresponding convex f such that minimizing the expected loss is equivalent to maximizing the f-divergence. We also provide necessary and sufficient conditions for an f-divergence to be realized from some (decreasing) convex loss function. More precisely, given a convex f, we provide a constructive procedure to generate all decreasing convex loss functions for which the correspondence holds. 2

3 The relationship is suggested in Figure 1; note in particular that there are in general many loss functions that correspond to the same f-divergence. PSfrag replacements φ 2 φ 1 φ3 f 1 f 2 f 3 Class of loss functions Class of f-divergence Figure 1. Illustration of the correspondence between f-divergence measures and loss functions. For each loss function φ there exists exactly one corresponding f-divergence for some convex f such that the φ-risk is equal to the negative f-divergence. Conversely, for each f-divergence, there exists a whole set of φ for which the correspondence holds. Within the class of convex loss functions and the class of f-divergence measures, one can construct equivalent loss functions and equivalent f-divergence measures, respectively. For the class of classification-calibrated decreasing convex loss functions, we can characterize the correspondence precisely. As examples of the general correspondence that we establish in this paper, we show that the hinge loss corresponds to the variational distance, the exponential loss corresponds to the Hellinger distance, and the logistic loss corresponds to the capacitory discrimination distance. Besides the intrinsic interest of these results as an extension of Blackwell s result, and the general crossfertilization that they permit between results in information theory and results in statistical machine learning, there are several specific consequences of our general theoretical development. First, there are numerous useful inequalities relating the various f-divergences (Topsoe, 2000); our theorem allows these inequalities to be exploited in the analysis of loss functions. Second, the minimizer of the Bayes error and the maximizer of the f-divergence are both known to possess certain extremal properties Tsitsiklis (1993a); our theorem allows these properties to be connected. Third, our theorem allows a notion of equivalence to be defined among loss functions loss functions are equivalent if they induce the same f-divergence. We specifically use the constructive nature of our theorem to exhibit all possible convex loss functions that are equivalent to the 0-1 loss. To illustrate our general theoretical result, we present an application to the problem of distributed detection. Rather than approaching the problem via the classical route of f-divergences, we instead approach the problem using the tools of statistical machine learning. We obtain a novel algorithmic framework for distributed detection for which we can prove strong convergence results. In particular, exploiting the equivalence alluded to above, we can show that for any surrogate loss function equivalent to 0-1 loss, our estimation procedure is consistent in the strong sense that it will asymptotically choose Bayes-optimal quantization rules. The paper is organized as follows. In Section 2, we define a version of discriminant analysis that is suitably general so as to include problems such as distributed detection and signal selection which involve an aspect of experiment design. We also provide a formal definition of surrogate loss functions and present examples of optimized risks based on these loss functions. In Section 3, we state and prove the correspondence theorem between surrogate loss functions and f-divergences. Section 4 illustrates the correspondence using well-known examples of loss functions and their f-divergence counterparts. In Section 5, we discuss connections between the choice of quantization schemes and Blackwell s classic results on comparisons 3

4 of experiments. Then we introduce notions of equivalence between loss functions (and f-divergence) and explore their properties. In Section 6, we establish the consistency of schemes for choosing Bayes-optimal classifiers based on surrogate loss functions that are equivalent to 0-1 loss. We present our conclusions in Section 7. 2 Background and elementary results Consider a covariate X X, where X is a compact topological space, and a random variable Y Y := { 1, +1}. The space (X Y ) is assumed to be endowed with a Borel regular probability measure P. In the classical discrimination (i.e., binary classification) problem the goal is to find a discriminant function based on i.i.d. samples from P. In this paper, we consider an elaboration of this problem in which the decisionmaker, rather than having direct access to X, observes a variable Z Z that is obtained via a (possibly stochastic) mapping Q : X Z. The mapping Q is referred to as an experiment in statistics; in the signal processing literature, where Z is generally taken to be discrete, it is referred to as a quantizer. We let Q denote the space of all stochastic Q, and let Q 0 denote its deterministic subset. Given a fixed experiment Q, we formulate a binary classification problem as the problem of finding a measurable function γ Γ := {Z R} that minimizes the Bayes risk P (Y sign(γ(z))). We are also interested in the broader question of determining both the classifier γ Γ and the experiment choice Q Q so as to minimize the Bayes risk. 2.1 Surrogate loss functions The Bayes risk corresponds to the expectation of the 0-1 loss φ(y, γ(z)) = I[y sign(γ(z))]. Given the nonconvexity of this loss function, it is natural to consider a surrogate loss function φ that we optimize in place of the 0-1 loss. In particular, we focus on loss functions of the form φ(y, γ(z)) = φ(yγ(z)), where φ : R R is an convex upper bound on the 0-1 loss. The quantity yγ(z) is known as the margin and φ(yγ(z)) is often referred to as a margin-based loss function. Given a particular loss function φ, we denote the associated φ-risk by R φ (γ, Q) := Eφ(Y γ(z)). The following are examples of loss functions used in the statistical machine learning literature: The hinge loss function is used in the support vector machine (SVM) algorithm (Schölkopf and Smola, 2002): φ hinge (yγ(z)) := max{1 yγ(z), 0}. (1) The logistic loss function is used in logistic regression (Friedman et al., 2000): φ log (yγ(z)) := log ( 1 + exp yγ(z) ). (2) The Adaboost algorithm (Freund and Schapire, 1997) uses a exponential loss function: Finally, we also consider the least squares function: φ exp (yγ(z)) := exp( yγ(z)). (3) φ sqr (yγ(z)) := (1 yγ(z)) 2. (4) Bartlett et al. (2005) have provided a general definition of surrogate loss functions. Their definition is crafted so as to permit the derivation of a general bound that links the φ-risk and the Bayes risk, thereby 4

5 permitting an elegant general treatment of the consistency of estimation procedures based on surrogate losses. The definition is essentially a pointwise form of a Fisher consistency condition that is appropriate for the classification setting. In particular, we have: Definition 1. A convex loss function φ is classification-calibrated if for any a, b 0 and a b, inf φ(α)a + φ( α)b > inf φ(α)a + φ( α)b. α:α(a b)<0 α R To obtain intuition for this definition, recall the representation of the φ risk given in equation (7): it implies that given a fixed Q, the optimal γ(z) takes a value α that minimizes φ(α)µ(z) + φ( α)π(z). In order for the decision rule γ to behave equivalently to the Bayes decision rule, we require that the optimal value of α (which defines γ(z)) should have the same sign as the Bayes decision rule sign(p (Y = 1 z) P (Y = 1 z)) = sign(µ(z) π(z)). For our purposes we will find it useful to consider a somewhat stronger definition of surrogate loss functions. In particular, we will make the following three assumptions: A1: φ is classification-calibrated. A2: φ : R R is continuous and convex. A3: Let α = inf α {φ(α) inf φ}. If α < +, then for any ɛ > 0, φ(α ɛ) φ(α + ɛ). (5) The interpretation of Assumption A3 is that one should penalize deviations away from α in the negative direction at least as strongly as deviations in the positive direction; this requirement is intuitively reasonable given the margin-based interpretation of α. This assumption is satisfied by all of the loss functions commonly considered in the literature; in particular, any decreasing function φ satisfies this condition, as does the least squares loss (which is not decreasing). Bartlett et al. (2005) also presented a simple lemma that we will find useful: Lemma 2. Let φ be a convex function. Then φ is classification-calibrated if and only if it is differentiable at 0 and φ (0) < 0. This lemma implies that Assumption A1 is equivalent to requiring that φ be differentiable at 0 and φ (0) < 0. These facts also imply that α > 0, where α is defined in Assumption A3. Finally, although φ is not defined for, we shall use the convention that φ( ) = Examples of optimum φ-risks For each fixed quantization rule Q, we define the optimal φ-risk (a function of Q) as follows: R φ (Q) := inf γ Γ R φ(γ, Q). (6) Given priors q = P (Y = 1) and p = P (Y = 1) on the hypothesis space (where p, q > 0, p + q = 1), we define positive measures µ and π over Z: µ(z) = P (Y = 1, z) = p Q z (x)dp (x Y = 1) x π(z) = P (Y = 1, z) = q Q z (x)dp (x Y = 1). 5 x

6 As a consequence of Lyapunov s theorem, the space of {(µ, π)} by varying Q Q (or Q 0 ) is both compact and convex (cf. Tsitsiklis, 1993a)). We will find the following lemma to be useful: Lemma 3. For each fixed fusion decision rule γ, Proof. See Appendix A. inf R φ(γ, Q) = min R φ (γ, Q). Q Q Q Q 0 For simplicity, in this paper, we assume that the spaces Q and Q 0 are restricted such that both µ and π are strictly positive measures. Note that the measures µ and π are constrained by the following simple relations: µ(z) + π(z) = P (z) for each z Z, µ(z) = P (Y = 1), z Z π(z) = P (Y = 1), z Z µ(z) + π(z) = 1. z Z Let η(x) = P (Y = 1 x). Note that Y and Z are independent conditioned on X. Therefore, we can write R φ (γ, Q) = E X φ(γ(z))η(x)q(z X) + φ( γ(z))(1 η(x))q(z X). (7) z On the basis of this equation, the φ-risk can be written in the following way: R φ (γ, Q) = Eφ(Y γ(z)) = z φ(γ(z))e X η(x)q(z X) + φ( γ(z))e X (1 η(x))q(z X) (8) = z φ(γ(z))µ(z) + φ( γ(z))π(z). (9) This representation allows us to compute the optimal value for γ(z) for all z Z, as well as the optimal φ-risk for a fixed Q. We illustrate with some examples: 0-1 loss. If φ is 0-1 loss, then γ(z) = sign(µ(z) π(z)). As a result, the optimal Bayes risk given a fixed Q takes the form: R bayes (Q) = min{µ(z), π(z)} = µ(z) π(z) 2 z Z = 1 (1 V (µ, π)), 2 where V (µ, π) denotes the variational distance between two measures µ and π: V (µ, π) := z Z µ(z) π(z). Hinge loss. If φ is hinge loss, then γ(z) = sign(µ(z) π(z)). As a result, the optimal risk for hinge loss takes the form: z Z R hinge (Q) = 2 min{µ(z), π(z)} = 1 z Z z Z = 1 V (µ, π) = 2R bayes (Q). µ(z) π(z) 6

7 Least squares loss. If φ is least squares loss, then γ(z) = µ(z) π(z) µ(z)+π(z). The optimal risk for hinge loss takes the form: R sqr (Q) = 4µ(z)π(z) µ(z) + π(z) = 1 (µ(z) π(z)) 2 µ(z) + π(z) z Z z Z = 1 (µ, π), (µ(z) π(z)) 2 where (µ, π) denotes the triangular discrimination distance: (µ, π) := z Z µ(z)+π(z). Logistic loss. If φ is logistic loss, then γ(z) = log µ(z) π(z). As a result, the optimal risk for logistic loss takes the form: R log (Q) = µ(z) + π(z) µ(z) log + π(z) log µ(z) z Z = log 2 C(µ, π), µ(z) + π(z) π(z) = log 2 KL(µ µ + π 2 ) KL(π µ + π ) 2 where KL(U, V ) denotes the Kullback-Leibler distance between two measures U and V, and C(U, V ) denotes the capacitory discrimination distance: C(U, V ) = KL(U U+V 2 ) + KL(V U+V 2 ). Exponential loss. If φ is exponential loss, then γ(z) = 1 µ(z) 2 log π(z). The optimal risk for exponential loss takes the form: R exp (Q) = z Z 2 µ(z)π(z) = 1 z Z = 1 2h 2 (µ, π), ( µ(z) π(z)) 2 where h(µ, π) denotes the Hellinger distance between measures µ and π: h 2 (µ, π) := 1 2 z Z ( µ(z) π(z)) 2. It is noteworthy that in all cases the optimum φ-risk takes the form of a well-known distance or divergence function. It is clearly of interest to investigate the generality of this relationship. 3 The correspondence between surrogate loss functions and distance measures In fact the correspondence that we have seen in the examples in the previous section is quite general. The first step to take in revealing the generality of the connection is to consider an appropriately general notion of distance function. We consider the class of f-divergence functions, a class that includes all of the examples discussed above and numerous others (Csiszaŕ, 1967, Ali and Silvey, 1966): Definition 4. Given any continuous convex function f : [0, + ) R {+ }, the f-divergence between measures µ and π is given by I f (µ, π) := ( ) z π(z)f µ(z). For instance, the variational distance is given by f(u) = u 1, Kullback-Leibler distance by f(u) = u ln u, triangular discrimination by f(u) = (u 1) 2 /(u+1), and Hellinger distance by f(u) = 1 2 ( u 1) 2. Other well-known concave f-divergences include the (negative) Bhattacharyya distance (f(u) = 2 u), and the (negative) harmonic distance (f(u) = 4u u+1 ). π(z) 7

8 As we have discussed in the introduction, these functions are widely used in the engineering literature to solve problems in distributed detection and signal selection. Specifically, given the joint distribution P (X, Y ) and given an experiment (e.g., quantizer) Q, one defines an f-divergence on the classconditional distributions P (Z Y = 1) and P (Z Y = 1). This f-divergence is maximized with respect to Q. Moreover, the discriminant function γ can generally be obtained explicitly in terms of the distributions P (Z Y = 1) and P (Z Y = 1). As we have discussed, the choice of the class of f-divergences as functions to optimize is motivated by Blackwell s classical theorem on the design of experiments, as well as by the computational intractability of minimizing the probability of error, a problem rendered particularly severe in practice when X is high dimensional (Kailath, 1967, Poor and Thomas, 1977, Longo et al., 1990). 3.1 From φ risk to f-divergence In this section and the following section, we present a general relationship between optimal φ-risks and f-divergences. The easier direction is from φ-risk to f-divergence; we cover this direction in the current section. We begin with a simple result that shows that any φ-risk induces a corresponding f-divergence. More precisely, the following lemma proves that the optimal φ-risk for a fixed Q can be written as the negative of an f-divergence between µ and π. Lemma 5. For each fixed Q, let γ Q be the optimal decision rule for the fusion center, then the φ-risk for (Q, γ Q ) is a f-divergence between µ and π for some convex function f: Moreover, this holds whether φ is convex or not. Proof. The optimal φ-risk takes the form: R φ (Q) = I f (µ, π). (10) R φ (Q) = min(φ(α)µ(z) + φ( α)π(z)) α z Z ( = z π(z) min α φ( α) + φ(α) µ(z) π(z) For each z let u = µ(z) π(z), then min α(φ( α) + φ(α)u) is a concave function of u (since minimization over a set of linear function is a concave function). Thus, if we define f(u) := min α (φ( α) + φ(α)u). (11) then the claim follows. Note that it holds regardless of the convexity of φ. Remark. We can also write I f (µ, π) in terms of an f-divergence between the two conditional distributions P (Z Y = 1) P 1 and P (Z Y = 1) P 1. Let the prior probability P (Y = 1) = q, then: I f (µ, π) = q ( ) (1 q)p1 P 1 f = I fq (P 1, P 1 ), qp z 1 where f q (u) := qf((1 q)u/q). It is equivalent to study either forms of distances. We prefer the former because the prior probabilities are absorbed in the formulae, but we shall return to the latter form when the connection to the general theory of comparison of experiments is discussed. ) 8

9 3.2 From f-divergence to φ risk In this section, we explore the converse of Lemma 5. Given a distance measure I f (µ, π) for some convex function f, does there exists a loss function φ for which R φ (Q) = I f (µ, π)? Can we establish such a correspondence between f and φ manifested by Eqn. (11)? In the following we shall show that such correspondence indeed exists for a general class of margin-based convex loss functions, and it is possible to construct φ given an appropriate corresponding f divergence Some intermediate functions Our approach to establishing the desired correspondence proceeds via some intermediate functions, which we define in this section. First, let us define, for each β, the inverse mapping where inf := +. The following result summarizes some useful properties of φ 1 : φ 1 (β) := inf{α : φ(α) β}, (12) Lemma 6. (a) For all β R such that φ 1 (β) < +, it holds that φ(φ 1 (β)) β. Furthermore, equality occurs when φ is continuous at φ 1 (β). (b) φ 1 : R R is a strictly decreasing convex function. Proof. See Appendix B. Using the function φ 1, we then define a new function Ψ : R R by { φ( φ 1 (β)) if φ 1 (β) R, Ψ(β) := + otherwise. (13) Note that the domain of Ψ is Dom(Ψ) = {β R : φ 1 (β) R}. Several important facts about Ψ are stated in the following lemma. Define β 1 := inf{β : Ψ(β) < + } and β 2 := inf{β : Ψ(β) = inf Ψ}. (14) It is simple to check that inf φ = inf Ψ = φ(α ), and β 1 = φ(α ), β 2 = φ( α ). Furthermore, Ψ(β 2 ) = φ(α ) = β 1, Ψ(β 1 ) = φ( α ) = β 2. Lemma 7. (a) Ψ is strictly decreasing in (β 1, β 2 ). If φ is decreasing, then Ψ is also decreasing in (, + ). In addition, Ψ(β) = + for β < β 1. (b) Ψ is convex in (, β 2 ]. If φ is a decreasing function, then Ψ is convex in (, + ). (c) Ψ is lower semi-continuous, and continuous in its domain. (d) For any α 0, φ(α) = Ψ(φ( α)). In particular, there exists u (β 1, β 2 ) such that Ψ(u ) = u. (e) There holds Ψ(Ψ(β)) β for all β Dom(Ψ). If φ is a continuous function within {α : φ(α) < + }, then there holds Ψ(Ψ(β)) = β for all β (β 1, β 2 ). 9

10 Proof. See Appendix C. Remark. The convexity of Ψ does not hold in general (i.e. when φ is not a decreasing function). For the least square loss φ(α) = (1 α) 2, Ψ is a convex function. For the following loss function, Ψ is not convex: (1 α) 2 when α 1 φ(α) = 0 when 1 α 2 α 2 otherwise. We have Ψ(9) = φ(2) = 0, Ψ(16) = φ(3) = 1, Ψ(25/2) = φ( 1 + 5/ 2) = 3 + 5/ 2 > (Ψ(9) + Ψ(16))/2. The following lemma characterizes the connection between Ψ and f-divergence. Lemma 8. (a) Given a loss function φ, function f defined in (11) satisfies: f(u) = Ψ ( u), where Ψ denotes the conjugate dual of function Ψ defined in (13), which satisfies properties specified in Lemma 7. If, in addition, φ is decreasing, then Ψ(β) = f ( β). (b) All such loss functions φ must share the following form: For any α > 0, φ(0) = u, (15a) φ(α) = Ψ(g(α + u )), (15b) φ( α) = g(α + u ), (15c) where u satisfies Ψ(u ) = u for some u (β 1, β 2 ) and g : [u, + ) R is any increasing continuous convex function such that g(u ) = u. Moreover, g is differentiable at u + and g (u +) > 0. (c) If Ψ is differentiable at u, then Ψ (u ) = 1. Proof. (a) From (11), we have ( ) f(u) = inf φ( α) + φ(α)u) α R ( ) = inf β:φ 1 (β) R;α:φ(α)=β φ( α) + βu For β such that φ 1 (β) R, there might be more than one α such that φ(α) = β. However, our assumption (5) ensures that α = φ 1 (β) results in minimum φ( α). Hence, ( ) f(u) = inf φ( φ 1 (β)) + βu = inf (βu + Ψ(β)) β:φ 1 (β) R β R = sup( βu Ψ(β)) = Ψ ( u). β R If φ is decreasing, then Ψ is convex. By convex duality and the lower semicontinuity of Ψ (from Lemma 7), we can also write: Ψ(β) = Ψ (β) = f ( β). (16) Hence, we can recover function Ψ once we know f. In the sequel, we shall show how to recover loss function φ from Ψ. 10

11 (b) Finally, we shall show that all convex loss function φ must have the same form as (15)for some g. We have shown in the in Lemma 7 that Ψ(φ(0)) = φ(0) (β 1, β 2 ). Hence, φ(0) takes value u (β 1, β 2 ) such that Ψ(u ) = u. Since φ is a decreasing convex function in (, 0], for any α 0, φ( α) can be written as the form: φ( α) = g(α + u ), where g is some increasing convex function. Invoking Lemma 7 again, for α 0, φ(α) = Ψ(φ( α)) = Ψ(g(α+u ). To ensure the continuity at 0, there holds u = φ(0) = g(u ). To ensure that φ is classificationcalibrated, φ is differentiable at 0 and φ (0) < 0. This implies that g is differentiable at u and g (u ) > 0. (c) If Ψ differentiable at u, then for α = 0, we can write φ (0+) = Ψ (u )g ( u ) = φ (0 ) = g (u ) < 0. This implies that Ψ (u ) = -1. According to Lemma 8(a), if Ψ is a lower semicontinuous convex function, it is possible to recover Ψ from f by means of convex duality (Rockafellar, 1970): Ψ(β) = f ( β). Thus, equation (13) provides means for recovering a loss function φ from Ψ. Indeed, the following theorem provides a constructive procedure for finding all such φ when Ψ satisfies necessary conditions specified in Lemma 7: A converse theorem Theorem 9. Given a lower semicontinuous convex function f : R R, define: Ψ(β) = f ( β). Recall that β 1 := inf{β : Ψ(β) < + } and β 2 := inf{β : Ψ(β) inf Ψ}. If Ψ satisfies the following properties: 1. There holds Ψ(Ψ(β)) = β for all β (β 1, β 2 ), 2. Ψ is a decreasing function. then we can construct all convex continuous loss function φ for which (10) and (11) hold. In fact, all such functions φ are of the form (15). If in addition, Ψ is differentiable at u, where u (β 1, β 2 ) such that Ψ(u ) = u (such a point can be proven to exist), then all such φ are classification-calibrated. Proof. By convex duality and the lower semicontinuity of f, we have f(u) = f (u) = Ψ ( u) = sup( βu Ψ(β)) = inf (βu + Ψ(β)). β R β R Lemma 8 asserted that all convex loss function φ for which (10), (11) hold must have the form (15). Ψ is lower semicontinuous and convex by definition. It remains to show that any convex loss function φ of form (15) must satisfy: Ψ(β) = φ( φ 1 (β)) when φ 1 (β) R, + otherwise. (17) Since Ψ is assumed to be a decreasing function, φ so defined is also a decreasing function. Given that Ψ(Ψ(β)) = β for any β (β 1, β 2 ), it is simple to check that there exists u (β 1, β 2 ) such that Ψ(u ) = u. For β u, there exists α 0 such that g(α + u ) = β. Choose the largest α that is so. From our definition of φ, φ( α) = β. Thus φ 1 (β) = α. It follows that φ( φ 1 (β)) = φ(α) = Ψ(g(α + u )) = Ψ(β). 11

12 For β < β 1 = inf Ψ, we have Ψ(β) = +. For β 1 β < u (< β 2 ). Then there must exist some α > 0 such that β = Ψ(g(α + u )), and g(α + u ) (β 1, β 2 ). Then β = φ(α) from our definition. Choose that smallest α that is so. Then φ 1 (β) = α. It follows that φ( φ 1 (β)) = φ( α) = g(α + u ) = Ψ(Ψ(g(α + u ))) = Ψ(β) (because g(α + u ) (β 1, β 2 )). If on the other hand, no such α exists, then β < inf Ψ = inf φ (thanks to the construction of φ), in which case, Ψ(β) = + by the assumption (see Lemma 7). Since Ψ is assumed to be lower semicontinuous convex function and continuous in its domain, and g is chosen to be increasing, continuous and convex in its domain, φ so defined is also continuous and convex. Finally, we need to check that φ is classification-calibrated. If Ψ is differentiable at u and Ψ(Ψ(β)) = β for β (β 1, β 2 ), it is simple to verify that Ψ (u ) = 1. As a result, it is straightforward to check that by choosing g to be differentiable at u, and g (u ) > 0, φ is also differentiable at 0 and φ (0) < 0. Hence, φ is also classification-calibrated. One interesting consequence of Theorem 9 that any realizable f-divergence can in fact be obtained from a fairly large set of φ loss functions. More precisely, examining the statement of Theorem 9(b) reveals that for α 0, we are free to choose a function g that must satisfy only mild conditions; given a choice of g, then φ is specified for α > 0 accordingly by equation (15). Corollary 10. Assume that φ is a decreasing (continuous convex) loss function corresponding to an f- divergence, where f is a continuous convex function that is bounded from below by an affine function. Then φ is unbounded from below if and only if f is 1-coercive, i.e., f(x)/ x + as x. Proof. φ is unbounded from below if and only if Ψ(β) = φ( φ 1 (β)) R for all β R, which is equivalent to function f(β) = Ψ ( β) being 1-coercive (cf. Hiriart-Urruty and Lemaréchal, 2001)). Most loss functions considered in machine learning and statistics are bounded from below (e.g., φ(α) 0 for all α R). For such loss functions that are also decreasing, f would be not 1-coercive. However, there are interesting f-divergence such as the symmetric KL distance considered in Bradt and Karlin (1956) such that f is 1-coercive. Examples of f-divergences and their corresponding loss functions are considered in the next section. While the above results characterize conditions for an f-divergence to be realized be some surrogate loss φ directly in terms of function f (and its convex conjugate Ψ), the following proposition states conditions in terms of the f-divergence per se that are sometimes simple to check. To begin, let us call an f-divergence to be symmetric if I f (µ, π) = I f (π, µ) for any measures µ and π. Corollary 11. (a) If φ induces an f-divergence I f by way of Lemma 5, then the f-divergence is symmetric. (b) If I f is symmetric and f(u) = + as u < 0, then f is realizable by some decreasing surrogate loss function φ. Note that the sufficient conditions for I f cover most surrogate loss functions found in practice. They could be loosened with further effort. Proof. (Sketch). (a) We have seen in Lemma 5 that R φ (Q) = I f (µ, π). Alternatively, we can also write: R φ (Q) = ( ) z µ(z) min α φ(α) + φ( α) π(z) µ(z) = ( ) z µ(z)f π(z) µ(z) = I f (π, µ). (b)we have I f (µ, π) = z π(z)f(µ(z)/π(z)) = z π(z) sup v(vµ(z)/π(z) f (v)) = z sup v(vµ(z) f (v)π(z)) = z v 1(z)µ(z) f (v 1 (z))π(z) for some v 1 (z) f(µ(z)/π(z)). 12

13 Similarly, I f (π, µ) = z v 2(z)π(z) f (v 2 (z))µ(z) for some v 2 (z) f(π(z)/µ(z)). Since I f is symmetric, it follows that v 1 (z) = f (v 2 (z)) and v 2 (z) = f (v 1 (z)) for any real value v 1 (z) f(µ(z)/π(z) and v 2 (z) f(π(z)/µ(z)). For simplicity of notation, replacing v 1 (z) and v 2 (z) by v 1 and v 2 respectively. By varying the ratio u = µ(z)/π(z) we can establish a mapping v 1 v 2 that satisfies v 2 = f (v 1 ) and v 1 = f (v 2 ) for any v 1 f(u), v 2 f(1/u) for some u > 0. Recall function Ψ(β) = f ( β). Then by definition, Ψ(Ψ( v 1 )) = Ψ(f (v 1 )) = Ψ( v 2 ) = f (v 2 ) = v 1 for any v 1 f(u) for some u > 0. This implies that Ψ(Ψ(β)) = β for any β { f(u), u > 0}. It follows that Ψ(Ψ(β)) = β for β (β 1, β 2 ). If f(u) = + for u < 0, then we can deduce that Ψ is a decreasing function. Hence I f is realizable by some surrogate loss function by Theorem 9. Examples of f-divergence measures not realizable by some margin-based surrogate loss (i.e. symmetric loss function) include f-divergence with f(u) = u s, where 0 < s < 1 and s 1/2; the Chi-squared distance, which corresponds to f(u) = (u 1) 2 ; as well as the Kullback-Leibler distances KL(µ, π) and KL(π, µ), which correspond to f(u) = log u and f(u) = u log u, respectively. 4 Examples of loss functions and f-divergence measures It is simple to check that if f 1 and f 2 are related by f 1 (u) = cf 2 (u) + au + b for some constants c > 0 and a, b, then I f1 (µ, π) = I f2 (µ, π) + ap (Y = 1) + bp (Y = 1), implying that maximizing I f1 and I f2 over Q are equivalent. Hence, in the following, we consider f 1 -divergence and f 2 -divergence as equivalent. The notion of equivalence between distance measures shall be considered more formally and in more details in the next section. Example 1 (Hellinger distance, negative Bhattacharyya distance). The Hellinger distance is equivalent to the negative of the Bhattacharyya distance, which is an f-divergence with f(u) = 2 u. Augment the domain of f with f(u) = + for u < 0. Recovering Ψ from f: Ψ(β) = f ( β) = sup( βu f(u)) = u R { 1/β when β > 0 + otherwise. Clearly, u = 1. Let g(u) = u, then a possible loss function is defined as, for α > 0: φ(0) = 1 φ(α) = 1/(α + 1) φ( α) = α + 1 Let g(u) = e u 1, then we get the exponential loss φ(α) = exp( α), agreeing with what was shown in the previous section. Example 2 (Variational distance). In the previous section, we showed that the f-divergence arised from the hinge loss and 0-1 loss is based on: f(u) = 2 min(u, 1) for u 0, which is eqivalent to the variational distance. Augment function f with f(u) = + for u < 0. Recovering Ψ from f: Ψ(β) = f ( β) = sup( βu f(u)) = u R 0 when β > 2 2 β when 0 β 2 + when β < 0. 13

14 Clearly u = 1. Choosing g(u) = u, then we get the hinge loss φ(α) = (1 α) +. Choosing g(u) = e u 1, then we get the following loss: φ(α) = (2 e α ) + for α 0 φ(α) = e α for α < 0. Example 3 (Capacitory discrimination distance). The capacitory discrimination distance is equivalent to an f-divergence, where f(u) = u log u+1 u log(u + 1), defined for u 0. Augment this function with f(u) = + for u < 0. Recoving Ψ from f, we have: { β log(e β 1) for β 0 Ψ(β) = sup βu f(u) = u R + otherwise. Clearly u = log 2. Choosing g(u) = log(1 + eu 2 ) gives the logistic loss φ(α) = log(1 + e α ). Example 4 (Triangular discrimination distance). Triangular discriminatory distance is equivalent to the negative of the harmonic distance, i.e., f-divergence with f(u) = 4u u+1 for u 0. Augment f with f(u) = + for u < 0. Then { (2 β) 2 for β 0 Ψ(β) = sup βu f(u) = u R + otherwise. Clearly u = 1. Chooseing g(u) = u 2 gives the least square loss φ(α) = (1 α) 2. Example 5 (Another Kullback-Leibler based distance). We have shown previously that up to a constant the logistic loss function corresponds to the capacitory discrimination distance C(µ, π) = KL(µ µ + π 2 ) + KL(π µ + π ). 2 While the KL distance measures ( both KL(µ π) and KL(π π)) are not realizable by any margin-based loss function due to the asymetricity, let us consider the symmetric Kullback-Leibler distance: KL s (µ, π) = KL(µ π) + KL(π µ), which is an f-divergence with f(u) = log u + u log u for u 0, and + otherwise. Observe that: Ψ(β) = sup βu + log u u log u. u 0 Taking derivative with respect to u of the expression inside the supremum to get: β +1/u log u 1 = 0. Define the following function r : [0, + ) [, + ]: r(u) = 1/u log u. It is easy to see that r(u) is a strictly decreasing function whose range covers the whole real line. Hence, Ψ(β) = u + log u 1 where β + 1 = r(u) = r(1/u) 1 ( = r 1 r 1 (β + 1) 14 ) 1.

15 It is simple to check that Ψ is strictly decreasing convex function, Ψ(0) = 0, and Ψ(Ψ(β)) = β for any β R. By Lemma 8 and Theorem 9, we can establish all corresponding convex loss functions, which happen to be strictly decreasing, and of the form (15): { g( α) for α 0 φ(α) = Ψ(g(α)) otherwise, where g : [0, + ) [0, + ) is any increasing convex function satisfying g(0) = 0. For a choice of g that results in a closed form for φ, let g(u) = e u + u 1. Then we obtain a valid loss function: φ(α) = e α α 1. 5 On comparison of surrogate loss functions and quantization schemes The previous section was devoted to study of the correspondence between f-divergences and the optimal φ-risk R φ (Q) for a fixed experiment Q. Our ultimate goal, however, is that of choosing an optimal Q, a problem in experimental design (Blackwell, 1953). In the remainder of this paper, we address the experiment design problem via the joint optimization of φ-risk (or more precisely, its empirical version) over both the decision γ and the choice of experiment Q (hereafter referred to as a quantizer). This procedure raises the natural theoretical question: for what loss functions φ does such joint optimization lead to minimum Bayes risk? Note that the minimum here is taken over both the decision rule γ and the space of experiments Q, so that this question is not covered by standard consistency results (Zhang, 2004, Steinwart, 2005, Bartlett et al., 2005). To this end, we shall consider the comparison of loss functions and the comparison of quantization schemes. In section 6 we describe how the results developed herein can be leveraged to resolve the issue of consistency of learning optimal quantizer design from empirical data. 5.1 Inequalities relating surrogate loss functions and f-divergences The correspondence between surrogate loss functions and f-divergence allows one to compare surrogate φ- risk by comparing the corresponding f-divergence measures and vice versa. For instance, since the optimal φ-risk for hinge loss is equivalent to the optimal φ-risk for 0-1 loss, we can say affirmatively that minimizing risk for hinge loss leads is equivalent to minimizing the Bayes risk. One particularly well-studied connection between f-divergence measures in the literature has been the inequalities among these divergence measures, some of which are stated in the following lemma. Lemma 12. (a)v 2 V. (b)2h 2 4h 2. As a result, 1 2 V 2 2h 2 V. (c) 1 2 C log 2. As a result, 1 2 V 2 C log 2 V. Proof. (a)that V is trivial. The first inequality can be derived by an application of Cauchy-Schwarz inequality: = z ( µ(z) π(z) µ(z) + π(z) ) 2 z ( ) µ(z) 2 ( 2 + π(z) µ(z) π(z) ) = V 2 (µ, π). 15 z

16 (b) Note that for any z Z, we have 1 ( µ(z)+ π(z)) 2 µ(z)+π(z) 2. Applying these inequalities in the following expression (µ, π) = ( µ(z) π(z)) 2 ( µ(z) + π(z)) 2 µ(z) + π(z) z Z yields 2h 2 4h 2. (c) See Topsoe (2000) for a proof. It is straightforward to derive the following connection between different risks. Lemma 13. (a) R hinge (Q) = 2R bayes (Q). (b) 2R bayes (Q) R sqr (Q) 1 (1 2R bayes (Q)) 2. (c) 2 log 2R bayes (Q) R log (Q) log (1 2R bayes(q)) 2. (d) 2R bayes (Q) R exp (Q) (1 2R bayes(q)) 2. According to the above lemma, all distance measures considered are bounded from above and below of the variational distance by a constant multiplier. To obtain the optimal quantization scheme Q, note that we want to maximize instead of minimize the correponding f-divergence. Except for the hinge loss, these lemmas does not tell us whether minimizing φ-risk leads to a classifier γ and compression rule Q with minimal Bayes risk. In the sequel, we shall discuss this issue in more details. For instance, we wish to find all φ loss such that minimizing φ-risk leads to the same optimal decision rule (Q, γ) as minimizing Bayes risk. 5.2 Connection between 0-1 loss and f-divergences The connection between f-divergences and 0-1 loss can be traced back to seminal work on comparison of experiments, pioneered by Blackwell and others (Blackwell, 1951, 1953, Bradt and Karlin, 1956). Definition 14. Q 1 is better than Q 2 if R Bayes (Q 1 ) R Bayes (Q 2 ) for any prior probabilities q = P (Y = 1) (0, 1). Recall that a choice of quantization scheme Q induces two conditional distributions P (Z Y = 1) P 1 and P (Z Y = 1) P 1. Hence, we shall use P Q 1 and P Q 1 to denote the fact that both P 1 and P 1 are determined by the specific choice of Q. By parameterizing the decision-theoretic criterion in terms of loss function φ and establishing a precise correspondence between between φ and the f-divergence, it is simple to derive the following theorem that relates 0-1 loss and f-divergences in a powerful way: Theorem 15. (Blackwell (Blackwell, 1951, 1953)) For two quantization schemes Q 1 and Q 2, the following statement are equivalent: 1. Q 1 is more better than Q 2 (i.e., R bayes (Q 1 ) R bayes (Q 2 ) for any prior probabilities q (0, 1)). 2. I f (P Q 1 1, P Q 1 1 ) I f (P Q 2 1, P Q 2 1 ), for all functions f for the form f(u) = min(u, c) for any c > I f (P Q 1 1, P Q 1 1 ) I f (P Q 2 1, P Q 2 1 ), for all convex functions f. 16

17 Proof. 1 2 : By the correspondence between 0-1 loss and an f-divergence with f(u) = min(u, 1), and the remark following Lemma 5, we have R bayes (Q) = I f (µ, π) = I fq (P 1, P 1 ), where f q (u) := qf( 1 q q u) = (1 q) min(u, q 1 q ). Hence, : Trivial. 2 3 : Any convex function f(u) can be uniformly approximated as a sum of a linear function and k α k min(u, c k ) where α k > 0, c k > 0 for all k. For a linear function f, I f (P 1, P 1 ) does not depend on P 1, P 1. Hence, 3 follows easily from 2. Corollary 16. Q 1 is more better than Q 2 if and only if R φ (Q 1 ) R φ (Q 2 ) for any loss function φ. Proof. By Lemma 5 R φ (Q) = I f (µ, π) = I fq (P 1, P 1 ). The corollary is immediate from the above theorem. One implication of Corollary 16 is that if R φ (Q 1 ) R φ (Q 2 ) for some loss function φ, then R bayes (Q 1 ) R bayes (Q 2 ) for some prior probability on the hypotheses P (Y ). This justifies using some surrogate loss function φ instead of the 0-1 loss for certain prior probability. However, it is very difficult to know what is the corresponding prior probability for a surrogate loss φ. In many applications, the prior probabilities on the hypotheses are fixed, and optimum quantization scheme Q are searched for a fixed region of priors. The notion of betterness is then limited in its usefulness. We are thus motivated to consider in the following subsection a different method of determining which loss functions (or equivalently, f-divergence measures) lead to the same optimal experimental design as the 0-1 loss (respectively the variational distance). 5.3 Comparison of surrogate loss functions (f-divergence measures) In the following definition, φ 1 and φ 2 are corresponding to f-divergence for convex function f 1 and f 2, respectively. Definition Given a probability distribution P (X, Y ), φ 1 is P-equivalent to φ 2 with respect to P (X, Y ), denoted by φ 1 P φ2, if for any quantization rules Q 1, Q 2, there holds: R φ1 (Q 1 ) R φ1 (Q 2 ) R φ2 (Q 1 ) R φ2 (Q 2 ). P Alternatively, we also say, f 1 f2. 2. Given two hypotheses P (X Y = 1) and P (X Y = 1), φ 1 is H-equivalent to φ 2 with respect to H P (X Y = ±1), denoted by φ 1 φ2, if for any quantization rules Q 1, Q 2, and any prior probability q = P (Y = 1), there holds: R φ1 (Q 1 ) R φ1 (Q 2 ) R φ2 (Q 1 ) R φ2 (Q 2 ). Alternatively, we also say, f 1 H f2. 3. φ 1 and φ 2 are universally equivalent, denoted by φ 1 u φ2, if for any P (X, Y ) and quantization rules Q 1, Q 2, there holds: R φ1 (Q 1 ) R φ1 (Q 2 ) R φ2 (Q 1 ) R φ2 (Q 2 ). Alternatively, we also say, f 1 u f2. 17

18 Remarks. 1. Clearly, universal equivalence implies H-equivalence, which in turn implies P-equivalence. The notions of P-equivalence and H-equivalence are most useful to a particular problem in hand, when the knowledge of the hypotheses P (X Y = ±1)and/or the prior probability q = P (Y = 1) is available. Nonetheless, in many cases, it seems difficult to determine such types of equivalence even if the underlying P (X, Y ) is known. The notion of universal equivalence is most useful when P (X, Y ) is not known, and is accessible by only empirical data (such as in Nguyen et al., 2005). 2. The notions of equivalence defined as above ensure that minimizing φ-risk R φ1 (Q) gives the optimal Q that is also optimum for minimizing R φ2 (Q). It is worth noting that this behavior is true even for algorithms optimizing the φ-risk over (Q, γ) using local methods. In the following, we characterize necessary and sufficient conditions for universal equivalence. The following fact is immediate. Lemma 18. If f 1 (u) = cf 2 (u) + au + b for some constants c > 0, and a, b, then f 1 u f2. Proof. Note that I f1 (µ, π) = ci f2 (µ, π) + a(1 q) + bq, implying f 1 u f2. The following necessary condition is very useful. Lemma 19. Given a continuous convex function f : R + R, define, for any u, v R +, such that f(u) f(v), define: { uα vβ f(u) + f(v) T f (u, v) := α β = f (α) f (β) α β If f 1 u f2, then for any u, v R, either one of following must be true: 1. T f (u, v) is well-defined for both f 1 and f 2, and T f1 (u, v) T f2 (u, v). 2. Both f 1 and f 2 are linear in (u, v). }. α f(u), β f(v), α β Note that if function f is differentiable at u and v and f (u) f (v), then T f (u, v) is reduced to a number: uf (u) vf (v) f(u) + f(v) f (u) f (v) where α = f (u), β = f (v), and f denotes the conjugate dual of f. = f (α) f (β), α β Proof. Consider the following P (X, Y ) such that P (Y = 1) = q = 1 P (Y = 1), P (X Y = 1) Uniform[0, b] and P (X Y = 1) Uniform[a, c], where 0 < a < b < c. Let Z {1, 2} be the quantized random variable of X, and consider a family of deterministic quantization scheme Q parameterized by t (a, b) such that Q(z = 1 x) = 1 when x t, and Q(z = 2 x) = 1 when x < t. Then, we have µ(1) = (1 q) t a c t ; µ(2) = (1 q) c a c a π(1) = q t b ; π(2) = q b t. b Therefore, the f-divergence between µ and π for a quantization scheme Q(t) has the form: I f (µ, π) = qt ( ) ( ) (t a)b(1 q) b f q(b t) (c t)b(1 q) + f. (c a)tq b (c a)(b t)q 18

19 u If f 1 f2, then I f1 (µ, π) and I f1 (µ, π) has the same monotonicity property everywhere, for any parameters q, a < b < c. Let γ = b(1 q) (c a)q, which can be chosen arbitrarily positive. Define, ( ) ( ) (t a)γ (c t)γ F (f, t) = tf + (b t)f. t b t Then F (f 1, t) and F (f 2, t) has the same monotonicity property, for any parameters γ, a < b < c. Due to the convexity of f-divergence with respect to (µ, π), we can deduce that F (f, t) is a convex function with respect to t (a, b). Hence, 0 F (f 1, t) 0 F (f 2, t). (19) From standard subdifferential calculus (e.g. Hiriart-Urruty and Lemaréchal, 2001)), we have F (f, t) = aγ ( ) ( ) ( ) ( ) (t a)γ (t a)γ (c t)γ t f (c b)γ (c t)γ + f f + f. t t b t b t b t Let u = (t a)γ t, v = (c t)γ b t. Note that we can choose arbitrary u, v, γ, and then choose a, b, c arcordingly. 0 F (f, t) 0 (γ u) f(u) + f(u) f(v) + (v γ) f(v) (20) α f(u), β f(v) s.t. 0 = (γ u)α + f(u) f(v) + (v γ)β (21) α f(u), β f(v) s.t. γ(α β) = uα f(u) + f(v) vβ (22) α f(u), β f(v) s.t. γ(α β) = f (α) f (β) (23) Now, we have (19) holds for any t, implying that for any u, v, γ, (23) holds for f 1 if and only if it also holds for f 2. If f 2 (u) f 2 (v) =, and f (α), f (β) < + for both f 1, f 2, there exist α 1 f 1 (u), β 1 f 1 (v), α 2 f 2 (u), β 2 f 2 (v), such that f 1 (α 1) f 1 (β 1) α 1 β 1 = f 2 (α 2) f 2 (β 2) α 2 β 2. If f 1 (u) f 1 (v). Due to the monotonicity of subdiffentials, function f 1 is linear in [u,v] with a slope s. Then, (23) holds for f 1 and any γ by choosing α = β = s. This implies that (23) also holds for f 2 for any γ. Thus, we deduce that f 2 is also a linear function in [u, v]. Theorem 20. If f 1 and f 2 are convex functions in [0, + ) R, differentiable almost everywhere, and f 1 u f2. Then f 1 (u) = cf 2 (u) + au + b for some constants c > 0 and a, b. Proof. 1. Let v is a point where both f 1 and f 2 are differentiable. Let d 1 = f 1 (v), d 2 = f 2 (v). Without loss of generality, assume f 1 (v) = f 2 (v) = 0 (if not, we can consider functions (with respect to u) f 1 (u) f 1 (v) and f 2 (u) f 2 (v) instead). Now, for any u where both f 1 and f 2 are differentiable, applying the above lemma for v and u, then either both f 1 and f 2 are linear in [v, u] (or [u, v] if u < v), in which case f 1 (u) = cf 2 (u) for some constant c, or the following is true: uf 1 (u) f 1(u) vd 1 f 1 (u) d 1 = uf 2 (u) f 2(u) vd 2 f 2 (u) d. 2 19

20 In either case, we have (uf 1(u) f 1 (u) vd 1 )(f 2(u) d 2 ) = (uf 2(u) f 2 (u) vd 2 )(f 1(u) d 1 ). Let f 1 (u) = g 1 (u) + d 1 u, f 2 (u) = g 2 (u) + d 2 u. Then, (ug 1 (u) g 1(u) vd 1 )g 2 (u) = (ug 2 (u) g 2(u) vd 2 )g 1 (u), implying that (g 1(u) + vd 1 )g 2 (u) = (g 2(u) + vd 2 )g 1 (u) for any u where f 1 and f 2 are both differentiable. It follows that g 1 (u) + vd 1 = c(g 2 (u) + vd 2 ) for some constant c and this constant c has to be the same for any u due to the continuity of f 1 and f 2. Hence, we have f 1 (u) = g 1 (u) + d 1 u = cg 2 (u) + d 1 u + cvd 2 vd 1 = cf 2 (u) + (d 1 cd 2 )u + cvd 2 vd 1. It is now simple to check so that I f1 and I f2 have the same monotonicity, it is necessary and sufficient that c > 0. Corollary All f-divergences (for continuous convex f : [0, + ) univerally equivalent to the variational distance must have the following form: f(u) = c min(u, 1) + au + b, for c > The 0-1 loss is universally equivalent all and only those loss functions whose corresponding f-divergence are based on f(u) = c min(u, 1) + au + b for c > 0. Proof. 2 follows immediately from 1. Note that the proof in Theorem 20 does not exactly apply here, because it requires both f 1 and f 2 to be differentiable almost everywhere. Nonetheless, it is simple to show that 1 also follows immediately from Lemma 19. Indeed, since the variational distance corresponds to f 1 (u) = u 1 = u+1 2 min{u, 1}, which is linear before 1 and also linear after 1. the same must be true for any continuous convex function f 2. All such functions can indeed be written as c min(u, 1) + au + b, for some constant c, a, b. To have the same monotonicity as that of f 1, it is necessary and sufficient that c > 0. The notion of universally equivalence is quite restrictive within the domain of f-divergence measures, as it requires the two universally equivalent f functions to be related by an additive linear term and a multipicative scalar constant. However, this do not necessarily extend over to the class of surrogate loss functions equivalent to 0-1 loss. As we have showed in the previous section, there exists a fairly large class of such surrogate loss functions. 5.4 Design of convex loss functions equivalent to 0-1 loss In this section, we study in more detail a class of surrogate loss functions φ that are universally equivalent to the 0-1 loss. As was the case in the classification literature in machine learning, the notion of surrogate loss functions is useful when we have access to P (X, Y ) through only empirical data. In such a situation, the decision rule (Q, γ) is optimized by minimizing an empirical version of the φ-risk Êφ(Y γ(z)). In this setting, we do not have close-form knowledge of µ(z) and π(z) based on which the optimal γ(z) is defined. In other words, we do not have a closed form solution for γ(z). What are desirable properties of a surrogate loss function? There are computational properties that we would like to have such as convexity, differentiablity, as well as statistical properties (such as consistency). By restricting our attention to surrogate loss functions that are universally equivalent to 0-1 loss, we shall be able to show (in the next section) that an algorithm by minimizing jointly over (Q, γ) of an empirical version of the φ-risk is universally consistent, i.e., it can achieve minimum Bayes risk when there are infinite empirical data. In this subsection, we shall prove that there does not exist an differentiable surrogate loss that are universally equivalent to the 0-1 loss. Before proceeding to the proof, let us present several examples of 20

On divergences, surrogate loss functions, and decentralized detection

On divergences, surrogate loss functions, and decentralized detection XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@eecs.berkeley.edu Martin J. Wainwright Statistics