On distance measures, surrogate loss functions, and distributed detection

Size: px
Start display at page:

Download "On distance measures, surrogate loss functions, and distributed detection"

Transcription

1 On distance measures, surrogate loss functions, and distributed detection XuanLong Nguyen Computer Science Division University of California, Berkeley Martin J. Wainwright EECS Department and Statistics Department University of California, Berkeley Michael I. Jordan EECS Department and Statistics Department University of California, Berkeley September 12, 2005 Technical Report Department of Statistics University of California, Berkeley Abstract In this paper, we show the correspondence between distance measures and surrogate loss functions in the context of decentralized binary hypothesis testing. This correspondence helps explicate the use of various distance measures in signal processing and quantization theory, as well as explain the behavior of surrogate loss functions often used in machine learning and statistics. We then develop a notion of equivalence among distance measures, and among loss functions. Finally, we investigate the statistical behavior of a nonparametric decentralized hypothesis testing algorithm by minimizing convex surrogate loss functions that are equivalent to the 0-1 loss. 1 Introduction Discriminant analysis has undergone significant and sustained development over several decades in various engineering fields, where elaborations of the basic paradigm have been developed that are responsive to particular constellations of physical, informational and computational constraints. For example, research in the area of distributed detection focuses on problems in which measurements are obtained by physicallydistributed devices which, due to power and bandwidth limitations, send quantized versions of their measurements to a central site where detection decisions are made (Tsitsiklis, 1993b, Blum et al., 1997). This problem is of significant current interest in the field of sensor networks (e.g. Chamberland and Veeravalli, 2003). Similar problems known as signal selection problems also blend discriminant analysis with aspects of experimental design (Kailath, 1967). 1

2 Such problems are generally formulated as hypothesis-testing problems, either within a Neyman-Pearson or Bayesian framework. Unfortunately, these formulations rarely lead to computationally tractable algorithms, and much of the focus has been on defining surrogates for the probability of error that lead to practical algorithms. For example, Hellinger distance has been championed for distributed detection problems, due to the fact that it yields a tractable algorithm both for the experimental design aspect of the problem (the choice of quantization rules) and the discriminant analysis aspect of the problem (Longo et al., 1990). More broadly, a class of functions known as Ali-Silvey distances or f-divergences which include Hellinger distance, as well as variational distance, Kullback-Leibler (KL) divergence and Chernoff distance have been explored as criteria to yield tractable approximations to the probability of error in a wide variety of applied discrimination problems (Ali and Silvey, 1966, Csiszaŕ, 1967). Theoretical support for the use of f-divergences in discrimination problems comes from two main sources. First, a classical result of Blackwell (1951) establishes that if procedure A has a smaller f- divergence than procedure B (for some particular f-divergence) then there must exist some set of prior probabilities such that procedure A has a smaller probability of error than procedure B. This is a weak justification, but it has proved useful in designing signal selection and quantization rules (Kailath, 1967, Poor and Thomas, 1977, Longo et al., 1990). Second, f-divergences often arise as exponents in asymptotic characterizations of the optimal rate of convergence in hypothesis-testing problems; examples include KL divergence (for the Neyman-Pearson problem) and Chernoff distance (for the Bayesian formulation). A parallel, more recent, line of research in the field of statistical machine learning has also focused on computationally-motivated surrogate functions in discriminant analysis. In statistical machine learning the formulation of discrimination problems (also known as classification problems) is decision-theoretic, with the Bayes error interpreted as risk under a 0-1 loss, with the algorithmic goal being that of minimizing the empirical expectation of 0-1 loss, and with empirical process theory providing the underlying framework for theoretical analysis. In this setting, the nonconvexity of the 0-1 loss is viewed as the source of the intractability of the minimization of probability of error, and researchers have studied algorithms that are based on replacing the 0-1 loss with surrogate loss functions that are convex upper bounds on the 0-1 loss (see Figure 1). A wide variety of practically successful machine learning algorithms have been based on this tactic, including the support vector machine (Schölkopf and Smola, 2002), AdaBoost(Freund and Schapire, 1997), X4 (Breiman, 1998) and logistic regression Friedman et al. (2000). Theoretical support for this line of research comes from the results of Bartlett et al. (2005), Zhang (2004), and others, who have provided characterizations of the class of surrogate loss functions in terms of consistency of the resulting estimation procedures and have shown how the rate of convergence to the Bayes optimal risk depends on properties of the surrogate loss functions. The f-divergences studied in information theory and the surrogate loss functions studied in statistical machine learning are different mathematical objects the former are functions on pairs of measures while the latter are functions on values of discriminant functions and class labels but their underlying role in obtaining computationally-tractable algorithms for discriminant analysis suggests that they should be related. Indeed, Blackwell s result hints at such a relationship, but its focus on 0-1 loss does not lend itself to developing relationships between specific f-divergences and specific surrogate loss functions. In the current paper we analyze the relationship between f-divergences and surrogate loss functions in detail, presenting a full characterization of the connection. We show that for any expected surrogate loss, regardless of its convexity, there exists a corresponding convex f such that minimizing the expected loss is equivalent to maximizing the f-divergence. We also provide necessary and sufficient conditions for an f-divergence to be realized from some (decreasing) convex loss function. More precisely, given a convex f, we provide a constructive procedure to generate all decreasing convex loss functions for which the correspondence holds. 2

3 The relationship is suggested in Figure 1; note in particular that there are in general many loss functions that correspond to the same f-divergence. PSfrag replacements φ 2 φ 1 φ3 f 1 f 2 f 3 Class of loss functions Class of f-divergence Figure 1. Illustration of the correspondence between f-divergence measures and loss functions. For each loss function φ there exists exactly one corresponding f-divergence for some convex f such that the φ-risk is equal to the negative f-divergence. Conversely, for each f-divergence, there exists a whole set of φ for which the correspondence holds. Within the class of convex loss functions and the class of f-divergence measures, one can construct equivalent loss functions and equivalent f-divergence measures, respectively. For the class of classification-calibrated decreasing convex loss functions, we can characterize the correspondence precisely. As examples of the general correspondence that we establish in this paper, we show that the hinge loss corresponds to the variational distance, the exponential loss corresponds to the Hellinger distance, and the logistic loss corresponds to the capacitory discrimination distance. Besides the intrinsic interest of these results as an extension of Blackwell s result, and the general crossfertilization that they permit between results in information theory and results in statistical machine learning, there are several specific consequences of our general theoretical development. First, there are numerous useful inequalities relating the various f-divergences (Topsoe, 2000); our theorem allows these inequalities to be exploited in the analysis of loss functions. Second, the minimizer of the Bayes error and the maximizer of the f-divergence are both known to possess certain extremal properties Tsitsiklis (1993a); our theorem allows these properties to be connected. Third, our theorem allows a notion of equivalence to be defined among loss functions loss functions are equivalent if they induce the same f-divergence. We specifically use the constructive nature of our theorem to exhibit all possible convex loss functions that are equivalent to the 0-1 loss. To illustrate our general theoretical result, we present an application to the problem of distributed detection. Rather than approaching the problem via the classical route of f-divergences, we instead approach the problem using the tools of statistical machine learning. We obtain a novel algorithmic framework for distributed detection for which we can prove strong convergence results. In particular, exploiting the equivalence alluded to above, we can show that for any surrogate loss function equivalent to 0-1 loss, our estimation procedure is consistent in the strong sense that it will asymptotically choose Bayes-optimal quantization rules. The paper is organized as follows. In Section 2, we define a version of discriminant analysis that is suitably general so as to include problems such as distributed detection and signal selection which involve an aspect of experiment design. We also provide a formal definition of surrogate loss functions and present examples of optimized risks based on these loss functions. In Section 3, we state and prove the correspondence theorem between surrogate loss functions and f-divergences. Section 4 illustrates the correspondence using well-known examples of loss functions and their f-divergence counterparts. In Section 5, we discuss connections between the choice of quantization schemes and Blackwell s classic results on comparisons 3

4 of experiments. Then we introduce notions of equivalence between loss functions (and f-divergence) and explore their properties. In Section 6, we establish the consistency of schemes for choosing Bayes-optimal classifiers based on surrogate loss functions that are equivalent to 0-1 loss. We present our conclusions in Section 7. 2 Background and elementary results Consider a covariate X X, where X is a compact topological space, and a random variable Y Y := { 1, +1}. The space (X Y ) is assumed to be endowed with a Borel regular probability measure P. In the classical discrimination (i.e., binary classification) problem the goal is to find a discriminant function based on i.i.d. samples from P. In this paper, we consider an elaboration of this problem in which the decisionmaker, rather than having direct access to X, observes a variable Z Z that is obtained via a (possibly stochastic) mapping Q : X Z. The mapping Q is referred to as an experiment in statistics; in the signal processing literature, where Z is generally taken to be discrete, it is referred to as a quantizer. We let Q denote the space of all stochastic Q, and let Q 0 denote its deterministic subset. Given a fixed experiment Q, we formulate a binary classification problem as the problem of finding a measurable function γ Γ := {Z R} that minimizes the Bayes risk P (Y sign(γ(z))). We are also interested in the broader question of determining both the classifier γ Γ and the experiment choice Q Q so as to minimize the Bayes risk. 2.1 Surrogate loss functions The Bayes risk corresponds to the expectation of the 0-1 loss φ(y, γ(z)) = I[y sign(γ(z))]. Given the nonconvexity of this loss function, it is natural to consider a surrogate loss function φ that we optimize in place of the 0-1 loss. In particular, we focus on loss functions of the form φ(y, γ(z)) = φ(yγ(z)), where φ : R R is an convex upper bound on the 0-1 loss. The quantity yγ(z) is known as the margin and φ(yγ(z)) is often referred to as a margin-based loss function. Given a particular loss function φ, we denote the associated φ-risk by R φ (γ, Q) := Eφ(Y γ(z)). The following are examples of loss functions used in the statistical machine learning literature: The hinge loss function is used in the support vector machine (SVM) algorithm (Schölkopf and Smola, 2002): φ hinge (yγ(z)) := max{1 yγ(z), 0}. (1) The logistic loss function is used in logistic regression (Friedman et al., 2000): φ log (yγ(z)) := log ( 1 + exp yγ(z) ). (2) The Adaboost algorithm (Freund and Schapire, 1997) uses a exponential loss function: Finally, we also consider the least squares function: φ exp (yγ(z)) := exp( yγ(z)). (3) φ sqr (yγ(z)) := (1 yγ(z)) 2. (4) Bartlett et al. (2005) have provided a general definition of surrogate loss functions. Their definition is crafted so as to permit the derivation of a general bound that links the φ-risk and the Bayes risk, thereby 4

5 permitting an elegant general treatment of the consistency of estimation procedures based on surrogate losses. The definition is essentially a pointwise form of a Fisher consistency condition that is appropriate for the classification setting. In particular, we have: Definition 1. A convex loss function φ is classification-calibrated if for any a, b 0 and a b, inf φ(α)a + φ( α)b > inf φ(α)a + φ( α)b. α:α(a b)<0 α R To obtain intuition for this definition, recall the representation of the φ risk given in equation (7): it implies that given a fixed Q, the optimal γ(z) takes a value α that minimizes φ(α)µ(z) + φ( α)π(z). In order for the decision rule γ to behave equivalently to the Bayes decision rule, we require that the optimal value of α (which defines γ(z)) should have the same sign as the Bayes decision rule sign(p (Y = 1 z) P (Y = 1 z)) = sign(µ(z) π(z)). For our purposes we will find it useful to consider a somewhat stronger definition of surrogate loss functions. In particular, we will make the following three assumptions: A1: φ is classification-calibrated. A2: φ : R R is continuous and convex. A3: Let α = inf α {φ(α) inf φ}. If α < +, then for any ɛ > 0, φ(α ɛ) φ(α + ɛ). (5) The interpretation of Assumption A3 is that one should penalize deviations away from α in the negative direction at least as strongly as deviations in the positive direction; this requirement is intuitively reasonable given the margin-based interpretation of α. This assumption is satisfied by all of the loss functions commonly considered in the literature; in particular, any decreasing function φ satisfies this condition, as does the least squares loss (which is not decreasing). Bartlett et al. (2005) also presented a simple lemma that we will find useful: Lemma 2. Let φ be a convex function. Then φ is classification-calibrated if and only if it is differentiable at 0 and φ (0) < 0. This lemma implies that Assumption A1 is equivalent to requiring that φ be differentiable at 0 and φ (0) < 0. These facts also imply that α > 0, where α is defined in Assumption A3. Finally, although φ is not defined for, we shall use the convention that φ( ) = Examples of optimum φ-risks For each fixed quantization rule Q, we define the optimal φ-risk (a function of Q) as follows: R φ (Q) := inf γ Γ R φ(γ, Q). (6) Given priors q = P (Y = 1) and p = P (Y = 1) on the hypothesis space (where p, q > 0, p + q = 1), we define positive measures µ and π over Z: µ(z) = P (Y = 1, z) = p Q z (x)dp (x Y = 1) x π(z) = P (Y = 1, z) = q Q z (x)dp (x Y = 1). 5 x

6 As a consequence of Lyapunov s theorem, the space of {(µ, π)} by varying Q Q (or Q 0 ) is both compact and convex (cf. Tsitsiklis, 1993a)). We will find the following lemma to be useful: Lemma 3. For each fixed fusion decision rule γ, Proof. See Appendix A. inf R φ(γ, Q) = min R φ (γ, Q). Q Q Q Q 0 For simplicity, in this paper, we assume that the spaces Q and Q 0 are restricted such that both µ and π are strictly positive measures. Note that the measures µ and π are constrained by the following simple relations: µ(z) + π(z) = P (z) for each z Z, µ(z) = P (Y = 1), z Z π(z) = P (Y = 1), z Z µ(z) + π(z) = 1. z Z Let η(x) = P (Y = 1 x). Note that Y and Z are independent conditioned on X. Therefore, we can write R φ (γ, Q) = E X φ(γ(z))η(x)q(z X) + φ( γ(z))(1 η(x))q(z X). (7) z On the basis of this equation, the φ-risk can be written in the following way: R φ (γ, Q) = Eφ(Y γ(z)) = z φ(γ(z))e X η(x)q(z X) + φ( γ(z))e X (1 η(x))q(z X) (8) = z φ(γ(z))µ(z) + φ( γ(z))π(z). (9) This representation allows us to compute the optimal value for γ(z) for all z Z, as well as the optimal φ-risk for a fixed Q. We illustrate with some examples: 0-1 loss. If φ is 0-1 loss, then γ(z) = sign(µ(z) π(z)). As a result, the optimal Bayes risk given a fixed Q takes the form: R bayes (Q) = min{µ(z), π(z)} = µ(z) π(z) 2 z Z = 1 (1 V (µ, π)), 2 where V (µ, π) denotes the variational distance between two measures µ and π: V (µ, π) := z Z µ(z) π(z). Hinge loss. If φ is hinge loss, then γ(z) = sign(µ(z) π(z)). As a result, the optimal risk for hinge loss takes the form: z Z R hinge (Q) = 2 min{µ(z), π(z)} = 1 z Z z Z = 1 V (µ, π) = 2R bayes (Q). µ(z) π(z) 6

7 Least squares loss. If φ is least squares loss, then γ(z) = µ(z) π(z) µ(z)+π(z). The optimal risk for hinge loss takes the form: R sqr (Q) = 4µ(z)π(z) µ(z) + π(z) = 1 (µ(z) π(z)) 2 µ(z) + π(z) z Z z Z = 1 (µ, π), (µ(z) π(z)) 2 where (µ, π) denotes the triangular discrimination distance: (µ, π) := z Z µ(z)+π(z). Logistic loss. If φ is logistic loss, then γ(z) = log µ(z) π(z). As a result, the optimal risk for logistic loss takes the form: R log (Q) = µ(z) + π(z) µ(z) log + π(z) log µ(z) z Z = log 2 C(µ, π), µ(z) + π(z) π(z) = log 2 KL(µ µ + π 2 ) KL(π µ + π ) 2 where KL(U, V ) denotes the Kullback-Leibler distance between two measures U and V, and C(U, V ) denotes the capacitory discrimination distance: C(U, V ) = KL(U U+V 2 ) + KL(V U+V 2 ). Exponential loss. If φ is exponential loss, then γ(z) = 1 µ(z) 2 log π(z). The optimal risk for exponential loss takes the form: R exp (Q) = z Z 2 µ(z)π(z) = 1 z Z = 1 2h 2 (µ, π), ( µ(z) π(z)) 2 where h(µ, π) denotes the Hellinger distance between measures µ and π: h 2 (µ, π) := 1 2 z Z ( µ(z) π(z)) 2. It is noteworthy that in all cases the optimum φ-risk takes the form of a well-known distance or divergence function. It is clearly of interest to investigate the generality of this relationship. 3 The correspondence between surrogate loss functions and distance measures In fact the correspondence that we have seen in the examples in the previous section is quite general. The first step to take in revealing the generality of the connection is to consider an appropriately general notion of distance function. We consider the class of f-divergence functions, a class that includes all of the examples discussed above and numerous others (Csiszaŕ, 1967, Ali and Silvey, 1966): Definition 4. Given any continuous convex function f : [0, + ) R {+ }, the f-divergence between measures µ and π is given by I f (µ, π) := ( ) z π(z)f µ(z). For instance, the variational distance is given by f(u) = u 1, Kullback-Leibler distance by f(u) = u ln u, triangular discrimination by f(u) = (u 1) 2 /(u+1), and Hellinger distance by f(u) = 1 2 ( u 1) 2. Other well-known concave f-divergences include the (negative) Bhattacharyya distance (f(u) = 2 u), and the (negative) harmonic distance (f(u) = 4u u+1 ). π(z) 7

8 As we have discussed in the introduction, these functions are widely used in the engineering literature to solve problems in distributed detection and signal selection. Specifically, given the joint distribution P (X, Y ) and given an experiment (e.g., quantizer) Q, one defines an f-divergence on the classconditional distributions P (Z Y = 1) and P (Z Y = 1). This f-divergence is maximized with respect to Q. Moreover, the discriminant function γ can generally be obtained explicitly in terms of the distributions P (Z Y = 1) and P (Z Y = 1). As we have discussed, the choice of the class of f-divergences as functions to optimize is motivated by Blackwell s classical theorem on the design of experiments, as well as by the computational intractability of minimizing the probability of error, a problem rendered particularly severe in practice when X is high dimensional (Kailath, 1967, Poor and Thomas, 1977, Longo et al., 1990). 3.1 From φ risk to f-divergence In this section and the following section, we present a general relationship between optimal φ-risks and f-divergences. The easier direction is from φ-risk to f-divergence; we cover this direction in the current section. We begin with a simple result that shows that any φ-risk induces a corresponding f-divergence. More precisely, the following lemma proves that the optimal φ-risk for a fixed Q can be written as the negative of an f-divergence between µ and π. Lemma 5. For each fixed Q, let γ Q be the optimal decision rule for the fusion center, then the φ-risk for (Q, γ Q ) is a f-divergence between µ and π for some convex function f: Moreover, this holds whether φ is convex or not. Proof. The optimal φ-risk takes the form: R φ (Q) = I f (µ, π). (10) R φ (Q) = min(φ(α)µ(z) + φ( α)π(z)) α z Z ( = z π(z) min α φ( α) + φ(α) µ(z) π(z) For each z let u = µ(z) π(z), then min α(φ( α) + φ(α)u) is a concave function of u (since minimization over a set of linear function is a concave function). Thus, if we define f(u) := min α (φ( α) + φ(α)u). (11) then the claim follows. Note that it holds regardless of the convexity of φ. Remark. We can also write I f (µ, π) in terms of an f-divergence between the two conditional distributions P (Z Y = 1) P 1 and P (Z Y = 1) P 1. Let the prior probability P (Y = 1) = q, then: I f (µ, π) = q ( ) (1 q)p1 P 1 f = I fq (P 1, P 1 ), qp z 1 where f q (u) := qf((1 q)u/q). It is equivalent to study either forms of distances. We prefer the former because the prior probabilities are absorbed in the formulae, but we shall return to the latter form when the connection to the general theory of comparison of experiments is discussed. ) 8

9 3.2 From f-divergence to φ risk In this section, we explore the converse of Lemma 5. Given a distance measure I f (µ, π) for some convex function f, does there exists a loss function φ for which R φ (Q) = I f (µ, π)? Can we establish such a correspondence between f and φ manifested by Eqn. (11)? In the following we shall show that such correspondence indeed exists for a general class of margin-based convex loss functions, and it is possible to construct φ given an appropriate corresponding f divergence Some intermediate functions Our approach to establishing the desired correspondence proceeds via some intermediate functions, which we define in this section. First, let us define, for each β, the inverse mapping where inf := +. The following result summarizes some useful properties of φ 1 : φ 1 (β) := inf{α : φ(α) β}, (12) Lemma 6. (a) For all β R such that φ 1 (β) < +, it holds that φ(φ 1 (β)) β. Furthermore, equality occurs when φ is continuous at φ 1 (β). (b) φ 1 : R R is a strictly decreasing convex function. Proof. See Appendix B. Using the function φ 1, we then define a new function Ψ : R R by { φ( φ 1 (β)) if φ 1 (β) R, Ψ(β) := + otherwise. (13) Note that the domain of Ψ is Dom(Ψ) = {β R : φ 1 (β) R}. Several important facts about Ψ are stated in the following lemma. Define β 1 := inf{β : Ψ(β) < + } and β 2 := inf{β : Ψ(β) = inf Ψ}. (14) It is simple to check that inf φ = inf Ψ = φ(α ), and β 1 = φ(α ), β 2 = φ( α ). Furthermore, Ψ(β 2 ) = φ(α ) = β 1, Ψ(β 1 ) = φ( α ) = β 2. Lemma 7. (a) Ψ is strictly decreasing in (β 1, β 2 ). If φ is decreasing, then Ψ is also decreasing in (, + ). In addition, Ψ(β) = + for β < β 1. (b) Ψ is convex in (, β 2 ]. If φ is a decreasing function, then Ψ is convex in (, + ). (c) Ψ is lower semi-continuous, and continuous in its domain. (d) For any α 0, φ(α) = Ψ(φ( α)). In particular, there exists u (β 1, β 2 ) such that Ψ(u ) = u. (e) There holds Ψ(Ψ(β)) β for all β Dom(Ψ). If φ is a continuous function within {α : φ(α) < + }, then there holds Ψ(Ψ(β)) = β for all β (β 1, β 2 ). 9

10 Proof. See Appendix C. Remark. The convexity of Ψ does not hold in general (i.e. when φ is not a decreasing function). For the least square loss φ(α) = (1 α) 2, Ψ is a convex function. For the following loss function, Ψ is not convex: (1 α) 2 when α 1 φ(α) = 0 when 1 α 2 α 2 otherwise. We have Ψ(9) = φ(2) = 0, Ψ(16) = φ(3) = 1, Ψ(25/2) = φ( 1 + 5/ 2) = 3 + 5/ 2 > (Ψ(9) + Ψ(16))/2. The following lemma characterizes the connection between Ψ and f-divergence. Lemma 8. (a) Given a loss function φ, function f defined in (11) satisfies: f(u) = Ψ ( u), where Ψ denotes the conjugate dual of function Ψ defined in (13), which satisfies properties specified in Lemma 7. If, in addition, φ is decreasing, then Ψ(β) = f ( β). (b) All such loss functions φ must share the following form: For any α > 0, φ(0) = u, (15a) φ(α) = Ψ(g(α + u )), (15b) φ( α) = g(α + u ), (15c) where u satisfies Ψ(u ) = u for some u (β 1, β 2 ) and g : [u, + ) R is any increasing continuous convex function such that g(u ) = u. Moreover, g is differentiable at u + and g (u +) > 0. (c) If Ψ is differentiable at u, then Ψ (u ) = 1. Proof. (a) From (11), we have ( ) f(u) = inf φ( α) + φ(α)u) α R ( ) = inf β:φ 1 (β) R;α:φ(α)=β φ( α) + βu For β such that φ 1 (β) R, there might be more than one α such that φ(α) = β. However, our assumption (5) ensures that α = φ 1 (β) results in minimum φ( α). Hence, ( ) f(u) = inf φ( φ 1 (β)) + βu = inf (βu + Ψ(β)) β:φ 1 (β) R β R = sup( βu Ψ(β)) = Ψ ( u). β R If φ is decreasing, then Ψ is convex. By convex duality and the lower semicontinuity of Ψ (from Lemma 7), we can also write: Ψ(β) = Ψ (β) = f ( β). (16) Hence, we can recover function Ψ once we know f. In the sequel, we shall show how to recover loss function φ from Ψ. 10

11 (b) Finally, we shall show that all convex loss function φ must have the same form as (15)for some g. We have shown in the in Lemma 7 that Ψ(φ(0)) = φ(0) (β 1, β 2 ). Hence, φ(0) takes value u (β 1, β 2 ) such that Ψ(u ) = u. Since φ is a decreasing convex function in (, 0], for any α 0, φ( α) can be written as the form: φ( α) = g(α + u ), where g is some increasing convex function. Invoking Lemma 7 again, for α 0, φ(α) = Ψ(φ( α)) = Ψ(g(α+u ). To ensure the continuity at 0, there holds u = φ(0) = g(u ). To ensure that φ is classificationcalibrated, φ is differentiable at 0 and φ (0) < 0. This implies that g is differentiable at u and g (u ) > 0. (c) If Ψ differentiable at u, then for α = 0, we can write φ (0+) = Ψ (u )g ( u ) = φ (0 ) = g (u ) < 0. This implies that Ψ (u ) = -1. According to Lemma 8(a), if Ψ is a lower semicontinuous convex function, it is possible to recover Ψ from f by means of convex duality (Rockafellar, 1970): Ψ(β) = f ( β). Thus, equation (13) provides means for recovering a loss function φ from Ψ. Indeed, the following theorem provides a constructive procedure for finding all such φ when Ψ satisfies necessary conditions specified in Lemma 7: A converse theorem Theorem 9. Given a lower semicontinuous convex function f : R R, define: Ψ(β) = f ( β). Recall that β 1 := inf{β : Ψ(β) < + } and β 2 := inf{β : Ψ(β) inf Ψ}. If Ψ satisfies the following properties: 1. There holds Ψ(Ψ(β)) = β for all β (β 1, β 2 ), 2. Ψ is a decreasing function. then we can construct all convex continuous loss function φ for which (10) and (11) hold. In fact, all such functions φ are of the form (15). If in addition, Ψ is differentiable at u, where u (β 1, β 2 ) such that Ψ(u ) = u (such a point can be proven to exist), then all such φ are classification-calibrated. Proof. By convex duality and the lower semicontinuity of f, we have f(u) = f (u) = Ψ ( u) = sup( βu Ψ(β)) = inf (βu + Ψ(β)). β R β R Lemma 8 asserted that all convex loss function φ for which (10), (11) hold must have the form (15). Ψ is lower semicontinuous and convex by definition. It remains to show that any convex loss function φ of form (15) must satisfy: Ψ(β) = φ( φ 1 (β)) when φ 1 (β) R, + otherwise. (17) Since Ψ is assumed to be a decreasing function, φ so defined is also a decreasing function. Given that Ψ(Ψ(β)) = β for any β (β 1, β 2 ), it is simple to check that there exists u (β 1, β 2 ) such that Ψ(u ) = u. For β u, there exists α 0 such that g(α + u ) = β. Choose the largest α that is so. From our definition of φ, φ( α) = β. Thus φ 1 (β) = α. It follows that φ( φ 1 (β)) = φ(α) = Ψ(g(α + u )) = Ψ(β). 11

12 For β < β 1 = inf Ψ, we have Ψ(β) = +. For β 1 β < u (< β 2 ). Then there must exist some α > 0 such that β = Ψ(g(α + u )), and g(α + u ) (β 1, β 2 ). Then β = φ(α) from our definition. Choose that smallest α that is so. Then φ 1 (β) = α. It follows that φ( φ 1 (β)) = φ( α) = g(α + u ) = Ψ(Ψ(g(α + u ))) = Ψ(β) (because g(α + u ) (β 1, β 2 )). If on the other hand, no such α exists, then β < inf Ψ = inf φ (thanks to the construction of φ), in which case, Ψ(β) = + by the assumption (see Lemma 7). Since Ψ is assumed to be lower semicontinuous convex function and continuous in its domain, and g is chosen to be increasing, continuous and convex in its domain, φ so defined is also continuous and convex. Finally, we need to check that φ is classification-calibrated. If Ψ is differentiable at u and Ψ(Ψ(β)) = β for β (β 1, β 2 ), it is simple to verify that Ψ (u ) = 1. As a result, it is straightforward to check that by choosing g to be differentiable at u, and g (u ) > 0, φ is also differentiable at 0 and φ (0) < 0. Hence, φ is also classification-calibrated. One interesting consequence of Theorem 9 that any realizable f-divergence can in fact be obtained from a fairly large set of φ loss functions. More precisely, examining the statement of Theorem 9(b) reveals that for α 0, we are free to choose a function g that must satisfy only mild conditions; given a choice of g, then φ is specified for α > 0 accordingly by equation (15). Corollary 10. Assume that φ is a decreasing (continuous convex) loss function corresponding to an f- divergence, where f is a continuous convex function that is bounded from below by an affine function. Then φ is unbounded from below if and only if f is 1-coercive, i.e., f(x)/ x + as x. Proof. φ is unbounded from below if and only if Ψ(β) = φ( φ 1 (β)) R for all β R, which is equivalent to function f(β) = Ψ ( β) being 1-coercive (cf. Hiriart-Urruty and Lemaréchal, 2001)). Most loss functions considered in machine learning and statistics are bounded from below (e.g., φ(α) 0 for all α R). For such loss functions that are also decreasing, f would be not 1-coercive. However, there are interesting f-divergence such as the symmetric KL distance considered in Bradt and Karlin (1956) such that f is 1-coercive. Examples of f-divergences and their corresponding loss functions are considered in the next section. While the above results characterize conditions for an f-divergence to be realized be some surrogate loss φ directly in terms of function f (and its convex conjugate Ψ), the following proposition states conditions in terms of the f-divergence per se that are sometimes simple to check. To begin, let us call an f-divergence to be symmetric if I f (µ, π) = I f (π, µ) for any measures µ and π. Corollary 11. (a) If φ induces an f-divergence I f by way of Lemma 5, then the f-divergence is symmetric. (b) If I f is symmetric and f(u) = + as u < 0, then f is realizable by some decreasing surrogate loss function φ. Note that the sufficient conditions for I f cover most surrogate loss functions found in practice. They could be loosened with further effort. Proof. (Sketch). (a) We have seen in Lemma 5 that R φ (Q) = I f (µ, π). Alternatively, we can also write: R φ (Q) = ( ) z µ(z) min α φ(α) + φ( α) π(z) µ(z) = ( ) z µ(z)f π(z) µ(z) = I f (π, µ). (b)we have I f (µ, π) = z π(z)f(µ(z)/π(z)) = z π(z) sup v(vµ(z)/π(z) f (v)) = z sup v(vµ(z) f (v)π(z)) = z v 1(z)µ(z) f (v 1 (z))π(z) for some v 1 (z) f(µ(z)/π(z)). 12

13 Similarly, I f (π, µ) = z v 2(z)π(z) f (v 2 (z))µ(z) for some v 2 (z) f(π(z)/µ(z)). Since I f is symmetric, it follows that v 1 (z) = f (v 2 (z)) and v 2 (z) = f (v 1 (z)) for any real value v 1 (z) f(µ(z)/π(z) and v 2 (z) f(π(z)/µ(z)). For simplicity of notation, replacing v 1 (z) and v 2 (z) by v 1 and v 2 respectively. By varying the ratio u = µ(z)/π(z) we can establish a mapping v 1 v 2 that satisfies v 2 = f (v 1 ) and v 1 = f (v 2 ) for any v 1 f(u), v 2 f(1/u) for some u > 0. Recall function Ψ(β) = f ( β). Then by definition, Ψ(Ψ( v 1 )) = Ψ(f (v 1 )) = Ψ( v 2 ) = f (v 2 ) = v 1 for any v 1 f(u) for some u > 0. This implies that Ψ(Ψ(β)) = β for any β { f(u), u > 0}. It follows that Ψ(Ψ(β)) = β for β (β 1, β 2 ). If f(u) = + for u < 0, then we can deduce that Ψ is a decreasing function. Hence I f is realizable by some surrogate loss function by Theorem 9. Examples of f-divergence measures not realizable by some margin-based surrogate loss (i.e. symmetric loss function) include f-divergence with f(u) = u s, where 0 < s < 1 and s 1/2; the Chi-squared distance, which corresponds to f(u) = (u 1) 2 ; as well as the Kullback-Leibler distances KL(µ, π) and KL(π, µ), which correspond to f(u) = log u and f(u) = u log u, respectively. 4 Examples of loss functions and f-divergence measures It is simple to check that if f 1 and f 2 are related by f 1 (u) = cf 2 (u) + au + b for some constants c > 0 and a, b, then I f1 (µ, π) = I f2 (µ, π) + ap (Y = 1) + bp (Y = 1), implying that maximizing I f1 and I f2 over Q are equivalent. Hence, in the following, we consider f 1 -divergence and f 2 -divergence as equivalent. The notion of equivalence between distance measures shall be considered more formally and in more details in the next section. Example 1 (Hellinger distance, negative Bhattacharyya distance). The Hellinger distance is equivalent to the negative of the Bhattacharyya distance, which is an f-divergence with f(u) = 2 u. Augment the domain of f with f(u) = + for u < 0. Recovering Ψ from f: Ψ(β) = f ( β) = sup( βu f(u)) = u R { 1/β when β > 0 + otherwise. Clearly, u = 1. Let g(u) = u, then a possible loss function is defined as, for α > 0: φ(0) = 1 φ(α) = 1/(α + 1) φ( α) = α + 1 Let g(u) = e u 1, then we get the exponential loss φ(α) = exp( α), agreeing with what was shown in the previous section. Example 2 (Variational distance). In the previous section, we showed that the f-divergence arised from the hinge loss and 0-1 loss is based on: f(u) = 2 min(u, 1) for u 0, which is eqivalent to the variational distance. Augment function f with f(u) = + for u < 0. Recovering Ψ from f: Ψ(β) = f ( β) = sup( βu f(u)) = u R 0 when β > 2 2 β when 0 β 2 + when β < 0. 13

14 Clearly u = 1. Choosing g(u) = u, then we get the hinge loss φ(α) = (1 α) +. Choosing g(u) = e u 1, then we get the following loss: φ(α) = (2 e α ) + for α 0 φ(α) = e α for α < 0. Example 3 (Capacitory discrimination distance). The capacitory discrimination distance is equivalent to an f-divergence, where f(u) = u log u+1 u log(u + 1), defined for u 0. Augment this function with f(u) = + for u < 0. Recoving Ψ from f, we have: { β log(e β 1) for β 0 Ψ(β) = sup βu f(u) = u R + otherwise. Clearly u = log 2. Choosing g(u) = log(1 + eu 2 ) gives the logistic loss φ(α) = log(1 + e α ). Example 4 (Triangular discrimination distance). Triangular discriminatory distance is equivalent to the negative of the harmonic distance, i.e., f-divergence with f(u) = 4u u+1 for u 0. Augment f with f(u) = + for u < 0. Then { (2 β) 2 for β 0 Ψ(β) = sup βu f(u) = u R + otherwise. Clearly u = 1. Chooseing g(u) = u 2 gives the least square loss φ(α) = (1 α) 2. Example 5 (Another Kullback-Leibler based distance). We have shown previously that up to a constant the logistic loss function corresponds to the capacitory discrimination distance C(µ, π) = KL(µ µ + π 2 ) + KL(π µ + π ). 2 While the KL distance measures ( both KL(µ π) and KL(π π)) are not realizable by any margin-based loss function due to the asymetricity, let us consider the symmetric Kullback-Leibler distance: KL s (µ, π) = KL(µ π) + KL(π µ), which is an f-divergence with f(u) = log u + u log u for u 0, and + otherwise. Observe that: Ψ(β) = sup βu + log u u log u. u 0 Taking derivative with respect to u of the expression inside the supremum to get: β +1/u log u 1 = 0. Define the following function r : [0, + ) [, + ]: r(u) = 1/u log u. It is easy to see that r(u) is a strictly decreasing function whose range covers the whole real line. Hence, Ψ(β) = u + log u 1 where β + 1 = r(u) = r(1/u) 1 ( = r 1 r 1 (β + 1) 14 ) 1.

15 It is simple to check that Ψ is strictly decreasing convex function, Ψ(0) = 0, and Ψ(Ψ(β)) = β for any β R. By Lemma 8 and Theorem 9, we can establish all corresponding convex loss functions, which happen to be strictly decreasing, and of the form (15): { g( α) for α 0 φ(α) = Ψ(g(α)) otherwise, where g : [0, + ) [0, + ) is any increasing convex function satisfying g(0) = 0. For a choice of g that results in a closed form for φ, let g(u) = e u + u 1. Then we obtain a valid loss function: φ(α) = e α α 1. 5 On comparison of surrogate loss functions and quantization schemes The previous section was devoted to study of the correspondence between f-divergences and the optimal φ-risk R φ (Q) for a fixed experiment Q. Our ultimate goal, however, is that of choosing an optimal Q, a problem in experimental design (Blackwell, 1953). In the remainder of this paper, we address the experiment design problem via the joint optimization of φ-risk (or more precisely, its empirical version) over both the decision γ and the choice of experiment Q (hereafter referred to as a quantizer). This procedure raises the natural theoretical question: for what loss functions φ does such joint optimization lead to minimum Bayes risk? Note that the minimum here is taken over both the decision rule γ and the space of experiments Q, so that this question is not covered by standard consistency results (Zhang, 2004, Steinwart, 2005, Bartlett et al., 2005). To this end, we shall consider the comparison of loss functions and the comparison of quantization schemes. In section 6 we describe how the results developed herein can be leveraged to resolve the issue of consistency of learning optimal quantizer design from empirical data. 5.1 Inequalities relating surrogate loss functions and f-divergences The correspondence between surrogate loss functions and f-divergence allows one to compare surrogate φ- risk by comparing the corresponding f-divergence measures and vice versa. For instance, since the optimal φ-risk for hinge loss is equivalent to the optimal φ-risk for 0-1 loss, we can say affirmatively that minimizing risk for hinge loss leads is equivalent to minimizing the Bayes risk. One particularly well-studied connection between f-divergence measures in the literature has been the inequalities among these divergence measures, some of which are stated in the following lemma. Lemma 12. (a)v 2 V. (b)2h 2 4h 2. As a result, 1 2 V 2 2h 2 V. (c) 1 2 C log 2. As a result, 1 2 V 2 C log 2 V. Proof. (a)that V is trivial. The first inequality can be derived by an application of Cauchy-Schwarz inequality: = z ( µ(z) π(z) µ(z) + π(z) ) 2 z ( ) µ(z) 2 ( 2 + π(z) µ(z) π(z) ) = V 2 (µ, π). 15 z

16 (b) Note that for any z Z, we have 1 ( µ(z)+ π(z)) 2 µ(z)+π(z) 2. Applying these inequalities in the following expression (µ, π) = ( µ(z) π(z)) 2 ( µ(z) + π(z)) 2 µ(z) + π(z) z Z yields 2h 2 4h 2. (c) See Topsoe (2000) for a proof. It is straightforward to derive the following connection between different risks. Lemma 13. (a) R hinge (Q) = 2R bayes (Q). (b) 2R bayes (Q) R sqr (Q) 1 (1 2R bayes (Q)) 2. (c) 2 log 2R bayes (Q) R log (Q) log (1 2R bayes(q)) 2. (d) 2R bayes (Q) R exp (Q) (1 2R bayes(q)) 2. According to the above lemma, all distance measures considered are bounded from above and below of the variational distance by a constant multiplier. To obtain the optimal quantization scheme Q, note that we want to maximize instead of minimize the correponding f-divergence. Except for the hinge loss, these lemmas does not tell us whether minimizing φ-risk leads to a classifier γ and compression rule Q with minimal Bayes risk. In the sequel, we shall discuss this issue in more details. For instance, we wish to find all φ loss such that minimizing φ-risk leads to the same optimal decision rule (Q, γ) as minimizing Bayes risk. 5.2 Connection between 0-1 loss and f-divergences The connection between f-divergences and 0-1 loss can be traced back to seminal work on comparison of experiments, pioneered by Blackwell and others (Blackwell, 1951, 1953, Bradt and Karlin, 1956). Definition 14. Q 1 is better than Q 2 if R Bayes (Q 1 ) R Bayes (Q 2 ) for any prior probabilities q = P (Y = 1) (0, 1). Recall that a choice of quantization scheme Q induces two conditional distributions P (Z Y = 1) P 1 and P (Z Y = 1) P 1. Hence, we shall use P Q 1 and P Q 1 to denote the fact that both P 1 and P 1 are determined by the specific choice of Q. By parameterizing the decision-theoretic criterion in terms of loss function φ and establishing a precise correspondence between between φ and the f-divergence, it is simple to derive the following theorem that relates 0-1 loss and f-divergences in a powerful way: Theorem 15. (Blackwell (Blackwell, 1951, 1953)) For two quantization schemes Q 1 and Q 2, the following statement are equivalent: 1. Q 1 is more better than Q 2 (i.e., R bayes (Q 1 ) R bayes (Q 2 ) for any prior probabilities q (0, 1)). 2. I f (P Q 1 1, P Q 1 1 ) I f (P Q 2 1, P Q 2 1 ), for all functions f for the form f(u) = min(u, c) for any c > I f (P Q 1 1, P Q 1 1 ) I f (P Q 2 1, P Q 2 1 ), for all convex functions f. 16

17 Proof. 1 2 : By the correspondence between 0-1 loss and an f-divergence with f(u) = min(u, 1), and the remark following Lemma 5, we have R bayes (Q) = I f (µ, π) = I fq (P 1, P 1 ), where f q (u) := qf( 1 q q u) = (1 q) min(u, q 1 q ). Hence, : Trivial. 2 3 : Any convex function f(u) can be uniformly approximated as a sum of a linear function and k α k min(u, c k ) where α k > 0, c k > 0 for all k. For a linear function f, I f (P 1, P 1 ) does not depend on P 1, P 1. Hence, 3 follows easily from 2. Corollary 16. Q 1 is more better than Q 2 if and only if R φ (Q 1 ) R φ (Q 2 ) for any loss function φ. Proof. By Lemma 5 R φ (Q) = I f (µ, π) = I fq (P 1, P 1 ). The corollary is immediate from the above theorem. One implication of Corollary 16 is that if R φ (Q 1 ) R φ (Q 2 ) for some loss function φ, then R bayes (Q 1 ) R bayes (Q 2 ) for some prior probability on the hypotheses P (Y ). This justifies using some surrogate loss function φ instead of the 0-1 loss for certain prior probability. However, it is very difficult to know what is the corresponding prior probability for a surrogate loss φ. In many applications, the prior probabilities on the hypotheses are fixed, and optimum quantization scheme Q are searched for a fixed region of priors. The notion of betterness is then limited in its usefulness. We are thus motivated to consider in the following subsection a different method of determining which loss functions (or equivalently, f-divergence measures) lead to the same optimal experimental design as the 0-1 loss (respectively the variational distance). 5.3 Comparison of surrogate loss functions (f-divergence measures) In the following definition, φ 1 and φ 2 are corresponding to f-divergence for convex function f 1 and f 2, respectively. Definition Given a probability distribution P (X, Y ), φ 1 is P-equivalent to φ 2 with respect to P (X, Y ), denoted by φ 1 P φ2, if for any quantization rules Q 1, Q 2, there holds: R φ1 (Q 1 ) R φ1 (Q 2 ) R φ2 (Q 1 ) R φ2 (Q 2 ). P Alternatively, we also say, f 1 f2. 2. Given two hypotheses P (X Y = 1) and P (X Y = 1), φ 1 is H-equivalent to φ 2 with respect to H P (X Y = ±1), denoted by φ 1 φ2, if for any quantization rules Q 1, Q 2, and any prior probability q = P (Y = 1), there holds: R φ1 (Q 1 ) R φ1 (Q 2 ) R φ2 (Q 1 ) R φ2 (Q 2 ). Alternatively, we also say, f 1 H f2. 3. φ 1 and φ 2 are universally equivalent, denoted by φ 1 u φ2, if for any P (X, Y ) and quantization rules Q 1, Q 2, there holds: R φ1 (Q 1 ) R φ1 (Q 2 ) R φ2 (Q 1 ) R φ2 (Q 2 ). Alternatively, we also say, f 1 u f2. 17

18 Remarks. 1. Clearly, universal equivalence implies H-equivalence, which in turn implies P-equivalence. The notions of P-equivalence and H-equivalence are most useful to a particular problem in hand, when the knowledge of the hypotheses P (X Y = ±1)and/or the prior probability q = P (Y = 1) is available. Nonetheless, in many cases, it seems difficult to determine such types of equivalence even if the underlying P (X, Y ) is known. The notion of universal equivalence is most useful when P (X, Y ) is not known, and is accessible by only empirical data (such as in Nguyen et al., 2005). 2. The notions of equivalence defined as above ensure that minimizing φ-risk R φ1 (Q) gives the optimal Q that is also optimum for minimizing R φ2 (Q). It is worth noting that this behavior is true even for algorithms optimizing the φ-risk over (Q, γ) using local methods. In the following, we characterize necessary and sufficient conditions for universal equivalence. The following fact is immediate. Lemma 18. If f 1 (u) = cf 2 (u) + au + b for some constants c > 0, and a, b, then f 1 u f2. Proof. Note that I f1 (µ, π) = ci f2 (µ, π) + a(1 q) + bq, implying f 1 u f2. The following necessary condition is very useful. Lemma 19. Given a continuous convex function f : R + R, define, for any u, v R +, such that f(u) f(v), define: { uα vβ f(u) + f(v) T f (u, v) := α β = f (α) f (β) α β If f 1 u f2, then for any u, v R, either one of following must be true: 1. T f (u, v) is well-defined for both f 1 and f 2, and T f1 (u, v) T f2 (u, v). 2. Both f 1 and f 2 are linear in (u, v). }. α f(u), β f(v), α β Note that if function f is differentiable at u and v and f (u) f (v), then T f (u, v) is reduced to a number: uf (u) vf (v) f(u) + f(v) f (u) f (v) where α = f (u), β = f (v), and f denotes the conjugate dual of f. = f (α) f (β), α β Proof. Consider the following P (X, Y ) such that P (Y = 1) = q = 1 P (Y = 1), P (X Y = 1) Uniform[0, b] and P (X Y = 1) Uniform[a, c], where 0 < a < b < c. Let Z {1, 2} be the quantized random variable of X, and consider a family of deterministic quantization scheme Q parameterized by t (a, b) such that Q(z = 1 x) = 1 when x t, and Q(z = 2 x) = 1 when x < t. Then, we have µ(1) = (1 q) t a c t ; µ(2) = (1 q) c a c a π(1) = q t b ; π(2) = q b t. b Therefore, the f-divergence between µ and π for a quantization scheme Q(t) has the form: I f (µ, π) = qt ( ) ( ) (t a)b(1 q) b f q(b t) (c t)b(1 q) + f. (c a)tq b (c a)(b t)q 18

19 u If f 1 f2, then I f1 (µ, π) and I f1 (µ, π) has the same monotonicity property everywhere, for any parameters q, a < b < c. Let γ = b(1 q) (c a)q, which can be chosen arbitrarily positive. Define, ( ) ( ) (t a)γ (c t)γ F (f, t) = tf + (b t)f. t b t Then F (f 1, t) and F (f 2, t) has the same monotonicity property, for any parameters γ, a < b < c. Due to the convexity of f-divergence with respect to (µ, π), we can deduce that F (f, t) is a convex function with respect to t (a, b). Hence, 0 F (f 1, t) 0 F (f 2, t). (19) From standard subdifferential calculus (e.g. Hiriart-Urruty and Lemaréchal, 2001)), we have F (f, t) = aγ ( ) ( ) ( ) ( ) (t a)γ (t a)γ (c t)γ t f (c b)γ (c t)γ + f f + f. t t b t b t b t Let u = (t a)γ t, v = (c t)γ b t. Note that we can choose arbitrary u, v, γ, and then choose a, b, c arcordingly. 0 F (f, t) 0 (γ u) f(u) + f(u) f(v) + (v γ) f(v) (20) α f(u), β f(v) s.t. 0 = (γ u)α + f(u) f(v) + (v γ)β (21) α f(u), β f(v) s.t. γ(α β) = uα f(u) + f(v) vβ (22) α f(u), β f(v) s.t. γ(α β) = f (α) f (β) (23) Now, we have (19) holds for any t, implying that for any u, v, γ, (23) holds for f 1 if and only if it also holds for f 2. If f 2 (u) f 2 (v) =, and f (α), f (β) < + for both f 1, f 2, there exist α 1 f 1 (u), β 1 f 1 (v), α 2 f 2 (u), β 2 f 2 (v), such that f 1 (α 1) f 1 (β 1) α 1 β 1 = f 2 (α 2) f 2 (β 2) α 2 β 2. If f 1 (u) f 1 (v). Due to the monotonicity of subdiffentials, function f 1 is linear in [u,v] with a slope s. Then, (23) holds for f 1 and any γ by choosing α = β = s. This implies that (23) also holds for f 2 for any γ. Thus, we deduce that f 2 is also a linear function in [u, v]. Theorem 20. If f 1 and f 2 are convex functions in [0, + ) R, differentiable almost everywhere, and f 1 u f2. Then f 1 (u) = cf 2 (u) + au + b for some constants c > 0 and a, b. Proof. 1. Let v is a point where both f 1 and f 2 are differentiable. Let d 1 = f 1 (v), d 2 = f 2 (v). Without loss of generality, assume f 1 (v) = f 2 (v) = 0 (if not, we can consider functions (with respect to u) f 1 (u) f 1 (v) and f 2 (u) f 2 (v) instead). Now, for any u where both f 1 and f 2 are differentiable, applying the above lemma for v and u, then either both f 1 and f 2 are linear in [v, u] (or [u, v] if u < v), in which case f 1 (u) = cf 2 (u) for some constant c, or the following is true: uf 1 (u) f 1(u) vd 1 f 1 (u) d 1 = uf 2 (u) f 2(u) vd 2 f 2 (u) d. 2 19

20 In either case, we have (uf 1(u) f 1 (u) vd 1 )(f 2(u) d 2 ) = (uf 2(u) f 2 (u) vd 2 )(f 1(u) d 1 ). Let f 1 (u) = g 1 (u) + d 1 u, f 2 (u) = g 2 (u) + d 2 u. Then, (ug 1 (u) g 1(u) vd 1 )g 2 (u) = (ug 2 (u) g 2(u) vd 2 )g 1 (u), implying that (g 1(u) + vd 1 )g 2 (u) = (g 2(u) + vd 2 )g 1 (u) for any u where f 1 and f 2 are both differentiable. It follows that g 1 (u) + vd 1 = c(g 2 (u) + vd 2 ) for some constant c and this constant c has to be the same for any u due to the continuity of f 1 and f 2. Hence, we have f 1 (u) = g 1 (u) + d 1 u = cg 2 (u) + d 1 u + cvd 2 vd 1 = cf 2 (u) + (d 1 cd 2 )u + cvd 2 vd 1. It is now simple to check so that I f1 and I f2 have the same monotonicity, it is necessary and sufficient that c > 0. Corollary All f-divergences (for continuous convex f : [0, + ) univerally equivalent to the variational distance must have the following form: f(u) = c min(u, 1) + au + b, for c > The 0-1 loss is universally equivalent all and only those loss functions whose corresponding f-divergence are based on f(u) = c min(u, 1) + au + b for c > 0. Proof. 2 follows immediately from 1. Note that the proof in Theorem 20 does not exactly apply here, because it requires both f 1 and f 2 to be differentiable almost everywhere. Nonetheless, it is simple to show that 1 also follows immediately from Lemma 19. Indeed, since the variational distance corresponds to f 1 (u) = u 1 = u+1 2 min{u, 1}, which is linear before 1 and also linear after 1. the same must be true for any continuous convex function f 2. All such functions can indeed be written as c min(u, 1) + au + b, for some constant c, a, b. To have the same monotonicity as that of f 1, it is necessary and sufficient that c > 0. The notion of universally equivalence is quite restrictive within the domain of f-divergence measures, as it requires the two universally equivalent f functions to be related by an additive linear term and a multipicative scalar constant. However, this do not necessarily extend over to the class of surrogate loss functions equivalent to 0-1 loss. As we have showed in the previous section, there exists a fairly large class of such surrogate loss functions. 5.4 Design of convex loss functions equivalent to 0-1 loss In this section, we study in more detail a class of surrogate loss functions φ that are universally equivalent to the 0-1 loss. As was the case in the classification literature in machine learning, the notion of surrogate loss functions is useful when we have access to P (X, Y ) through only empirical data. In such a situation, the decision rule (Q, γ) is optimized by minimizing an empirical version of the φ-risk Êφ(Y γ(z)). In this setting, we do not have close-form knowledge of µ(z) and π(z) based on which the optimal γ(z) is defined. In other words, we do not have a closed form solution for γ(z). What are desirable properties of a surrogate loss function? There are computational properties that we would like to have such as convexity, differentiablity, as well as statistical properties (such as consistency). By restricting our attention to surrogate loss functions that are universally equivalent to 0-1 loss, we shall be able to show (in the next section) that an algorithm by minimizing jointly over (Q, γ) of an empirical version of the φ-risk is universally consistent, i.e., it can achieve minimum Bayes risk when there are infinite empirical data. In this subsection, we shall prove that there does not exist an differentiable surrogate loss that are universally equivalent to the 0-1 loss. Before proceeding to the proof, let us present several examples of 20

On divergences, surrogate loss functions, and decentralized detection

On divergences, surrogate loss functions, and decentralized detection On divergences, surrogate loss functions, and decentralized detection XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@eecs.berkeley.edu Martin J. Wainwright Statistics

More information

Divergences, surrogate loss functions and experimental design

Divergences, surrogate loss functions and experimental design Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs.berkeley.edu Martin J. Wainwright University of California Berkeley,

More information

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing XuanLong Nguyen Martin J. Wainwright Michael I. Jordan Electrical Engineering & Computer Science Department

More information

On surrogate loss functions and f-divergences

On surrogate loss functions and f-divergences On surrogate loss functions and f-divergences XuanLong Nguyen, Martin J. Wainwright, xuanlong.nguyen@stat.duke.edu wainwrig@stat.berkeley.edu Michael I. Jordan, jordan@stat.berkeley.edu Department of Statistical

More information

ON SURROGATE LOSS FUNCTIONS AND f -DIVERGENCES 1

ON SURROGATE LOSS FUNCTIONS AND f -DIVERGENCES 1 The Annals of Statistics 2009, Vol. 37, No. 2, 876 904 DOI: 10.1214/08-AOS595 Institute of Mathematical Statistics, 2009 ON SURROGATE LOSS FUNCTIONS AND f -DIVERGENCES 1 BY XUANLONG NGUYEN, MARTIN J. WAINWRIGHT

More information

Are You a Bayesian or a Frequentist?

Are You a Bayesian or a Frequentist? Are You a Bayesian or a Frequentist? Michael I. Jordan Department of EECS Department of Statistics University of California, Berkeley http://www.cs.berkeley.edu/ jordan 1 Statistical Inference Bayesian

More information

Surrogate loss functions, divergences and decentralized detection

Surrogate loss functions, divergences and decentralized detection Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1

More information

Decentralized decision making with spatially distributed data

Decentralized decision making with spatially distributed data Decentralized decision making with spatially distributed data XuanLong Nguyen Department of Statistics University of Michigan Acknowledgement: Michael Jordan, Martin Wainwright, Ram Rajagopal, Pravin Varaiya

More information

ADECENTRALIZED detection system typically involves a

ADECENTRALIZED detection system typically involves a IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL 53, NO 11, NOVEMBER 2005 4053 Nonparametric Decentralized Detection Using Kernel Methods XuanLong Nguyen, Martin J Wainwright, Member, IEEE, and Michael I Jordan,

More information

Decentralized Detection and Classification using Kernel Methods

Decentralized Detection and Classification using Kernel Methods Decentralized Detection and Classification using Kernel Methods XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@cs.berkeley.edu Martin J. Wainwright Electrical Engineering

More information

Nonparametric Decentralized Detection using Kernel Methods

Nonparametric Decentralized Detection using Kernel Methods Nonparametric Decentralized Detection using Kernel Methods XuanLong Nguyen xuanlong@cs.berkeley.edu Martin J. Wainwright, wainwrig@eecs.berkeley.edu Michael I. Jordan, jordan@cs.berkeley.edu Electrical

More information

Decentralized Detection and Classification using Kernel Methods

Decentralized Detection and Classification using Kernel Methods Decentralied Detection and Classification using Kernel Methods XuanLong Nguyen XUANLONG@CS.BERKELEY.EDU Martin J. Wainwright WAINWRIG@EECS.BERKELEY.EDU Michael I. Jordan JORDAN@CS.BERKELEY.EDU Computer

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

On optimal quantization rules for some sequential decision problems

On optimal quantization rules for some sequential decision problems On optimal quantization rules for some sequential decision problems XuanLong Nguyen Department of EECS University of California, Berkeley xuanlong@cs.berkeley.edu Martin J. Wainwright Department of Statistics

More information

Convex Optimization Notes

Convex Optimization Notes Convex Optimization Notes Jonathan Siegel January 2017 1 Convex Analysis This section is devoted to the study of convex functions f : B R {+ } and convex sets U B, for B a Banach space. The case of B =

More information

A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions

A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions Angelia Nedić and Asuman Ozdaglar April 15, 2006 Abstract We provide a unifying geometric framework for the

More information

Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

AdaBoost and other Large Margin Classifiers: Convexity in Classification

AdaBoost and other Large Margin Classifiers: Convexity in Classification AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at

More information

Legendre-Fenchel transforms in a nutshell

Legendre-Fenchel transforms in a nutshell 1 2 3 Legendre-Fenchel transforms in a nutshell Hugo Touchette School of Mathematical Sciences, Queen Mary, University of London, London E1 4NS, UK Started: July 11, 2005; last compiled: August 14, 2007

More information

Lecture 10 February 23

Lecture 10 February 23 EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only

More information

g(.) 1/ N 1/ N Decision Decision Device u u u u CP

g(.) 1/ N 1/ N Decision Decision Device u u u u CP Distributed Weak Signal Detection and Asymptotic Relative Eciency in Dependent Noise Hakan Delic Signal and Image Processing Laboratory (BUSI) Department of Electrical and Electronics Engineering Bogazici

More information

Estimating divergence functionals and the likelihood ratio by convex risk minimization

Estimating divergence functionals and the likelihood ratio by convex risk minimization Estimating divergence functionals and the likelihood ratio by convex risk minimization XuanLong Nguyen Dept. of Statistical Science Duke University xuanlong.nguyen@stat.duke.edu Martin J. Wainwright Dept.

More information

Boosting with Early Stopping: Convergence and Consistency

Boosting with Early Stopping: Convergence and Consistency Boosting with Early Stopping: Convergence and Consistency Tong Zhang Bin Yu Abstract Boosting is one of the most significant advances in machine learning for classification and regression. In its original

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Classification with Reject Option

Classification with Reject Option Classification with Reject Option Bartlett and Wegkamp (2008) Wegkamp and Yuan (2010) February 17, 2012 Outline. Introduction.. Classification with reject option. Spirit of the papers BW2008.. Infinite

More information

Robustness and duality of maximum entropy and exponential family distributions

Robustness and duality of maximum entropy and exponential family distributions Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat

More information

Calibrated Surrogate Losses

Calibrated Surrogate Losses EECS 598: Statistical Learning Theory, Winter 2014 Topic 14 Calibrated Surrogate Losses Lecturer: Clayton Scott Scribe: Efrén Cruz Cortés Disclaimer: These notes have not been subjected to the usual scrutiny

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Learning in decentralized systems: A nonparametric approach. XuanLong Nguyen. Doctor of Philosophy

Learning in decentralized systems: A nonparametric approach. XuanLong Nguyen. Doctor of Philosophy Learning in decentralized systems: A nonparametric approach by XuanLong Nguyen B.S. (Pohang University of Science and Technology) 1999 M.S. (Arizona State University) 2001 M.A. (University of California,

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Lecture 4 Lebesgue spaces and inequalities

Lecture 4 Lebesgue spaces and inequalities Lecture 4: Lebesgue spaces and inequalities 1 of 10 Course: Theory of Probability I Term: Fall 2013 Instructor: Gordan Zitkovic Lecture 4 Lebesgue spaces and inequalities Lebesgue spaces We have seen how

More information

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM Due: Monday, April 11, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to

More information

A Unified Analysis of Nonconvex Optimization Duality and Penalty Methods with General Augmenting Functions

A Unified Analysis of Nonconvex Optimization Duality and Penalty Methods with General Augmenting Functions A Unified Analysis of Nonconvex Optimization Duality and Penalty Methods with General Augmenting Functions Angelia Nedić and Asuman Ozdaglar April 16, 2006 Abstract In this paper, we study a unifying framework

More information

Decentralized Detection in Sensor Networks

Decentralized Detection in Sensor Networks IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL 51, NO 2, FEBRUARY 2003 407 Decentralized Detection in Sensor Networks Jean-François Chamberland, Student Member, IEEE, and Venugopal V Veeravalli, Senior Member,

More information

DIVERGENCES (or pseudodistances) based on likelihood

DIVERGENCES (or pseudodistances) based on likelihood IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 56, NO 11, NOVEMBER 2010 5847 Estimating Divergence Functionals the Likelihood Ratio by Convex Risk Minimization XuanLong Nguyen, Martin J Wainwright, Michael

More information

Optimality Conditions for Constrained Optimization

Optimality Conditions for Constrained Optimization 72 CHAPTER 7 Optimality Conditions for Constrained Optimization 1. First Order Conditions In this section we consider first order optimality conditions for the constrained problem P : minimize f 0 (x)

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

BAYESIAN DESIGN OF DECENTRALIZED HYPOTHESIS TESTING UNDER COMMUNICATION CONSTRAINTS. Alla Tarighati, and Joakim Jaldén

BAYESIAN DESIGN OF DECENTRALIZED HYPOTHESIS TESTING UNDER COMMUNICATION CONSTRAINTS. Alla Tarighati, and Joakim Jaldén 204 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) BAYESIA DESIG OF DECETRALIZED HYPOTHESIS TESTIG UDER COMMUICATIO COSTRAITS Alla Tarighati, and Joakim Jaldén ACCESS

More information

Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection

Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection 2708 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 11, NOVEMBER 2004 Exact Minimax Strategies for Predictive Density Estimation, Data Compression, Model Selection Feng Liang Andrew Barron, Senior

More information

Convexity, Detection, and Generalized f-divergences

Convexity, Detection, and Generalized f-divergences Convexity, Detection, and Generalized f-divergences Khashayar Khosravi Feng Ruan John Duchi June 5, 015 1 Introduction The goal of classification problem is to learn a discriminant function for classification

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

arxiv: v4 [cs.it] 17 Oct 2015

arxiv: v4 [cs.it] 17 Oct 2015 Upper Bounds on the Relative Entropy and Rényi Divergence as a Function of Total Variation Distance for Finite Alphabets Igal Sason Department of Electrical Engineering Technion Israel Institute of Technology

More information

topics about f-divergence

topics about f-divergence topics about f-divergence Presented by Liqun Chen Mar 16th, 2018 1 Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments

More information

A Rothschild-Stiglitz approach to Bayesian persuasion

A Rothschild-Stiglitz approach to Bayesian persuasion A Rothschild-Stiglitz approach to Bayesian persuasion Matthew Gentzkow and Emir Kamenica Stanford University and University of Chicago September 2015 Abstract Rothschild and Stiglitz (1970) introduce a

More information

Learning in decentralized systems: A nonparametric approach

Learning in decentralized systems: A nonparametric approach Learning in decentralized systems: A nonparametric approach Xuanlong Nguyen Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2007-111 http://www.eecs.berkeley.edu/pubs/techrpts/2007/eecs-2007-111.html

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

On Bayes Risk Lower Bounds

On Bayes Risk Lower Bounds Journal of Machine Learning Research 17 (2016) 1-58 Submitted 4/16; Revised 10/16; Published 12/16 On Bayes Risk Lower Bounds Xi Chen Stern School of Business New York University New York, NY 10012, USA

More information

Smart Predict, then Optimize

Smart Predict, then Optimize . Smart Predict, then Optimize arxiv:1710.08005v2 [math.oc] 14 Dec 2017 Adam N. Elmachtoub Department of Industrial Engineering and Operations Research, Columbia University, New York, NY 10027, adam@ieor.columbia.edu

More information

Stanford Statistics 311/Electrical Engineering 377

Stanford Statistics 311/Electrical Engineering 377 I. Bayes risk in classification problems a. Recall definition (1.2.3) of f-divergence between two distributions P and Q as ( ) p(x) D f (P Q) : q(x)f dx, q(x) where f : R + R is a convex function satisfying

More information

Legendre-Fenchel transforms in a nutshell

Legendre-Fenchel transforms in a nutshell 1 2 3 Legendre-Fenchel transforms in a nutshell Hugo Touchette School of Mathematical Sciences, Queen Mary, University of London, London E1 4NS, UK Started: July 11, 2005; last compiled: October 16, 2014

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

An introduction to some aspects of functional analysis

An introduction to some aspects of functional analysis An introduction to some aspects of functional analysis Stephen Semmes Rice University Abstract These informal notes deal with some very basic objects in functional analysis, including norms and seminorms

More information

Subdifferential representation of convex functions: refinements and applications

Subdifferential representation of convex functions: refinements and applications Subdifferential representation of convex functions: refinements and applications Joël Benoist & Aris Daniilidis Abstract Every lower semicontinuous convex function can be represented through its subdifferential

More information

Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

On the Consistency of AUC Pairwise Optimization

On the Consistency of AUC Pairwise Optimization On the Consistency of AUC Pairwise Optimization Wei Gao and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University Collaborative Innovation Center of Novel Software Technology

More information

On deterministic reformulations of distributionally robust joint chance constrained optimization problems

On deterministic reformulations of distributionally robust joint chance constrained optimization problems On deterministic reformulations of distributionally robust joint chance constrained optimization problems Weijun Xie and Shabbir Ahmed School of Industrial & Systems Engineering Georgia Institute of Technology,

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

Decentralized Detection In Wireless Sensor Networks

Decentralized Detection In Wireless Sensor Networks Decentralized Detection In Wireless Sensor Networks Milad Kharratzadeh Department of Electrical & Computer Engineering McGill University Montreal, Canada April 2011 Statistical Detection and Estimation

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Approximation Metrics for Discrete and Continuous Systems

Approximation Metrics for Discrete and Continuous Systems University of Pennsylvania ScholarlyCommons Departmental Papers (CIS) Department of Computer & Information Science May 2007 Approximation Metrics for Discrete Continuous Systems Antoine Girard University

More information

Global minimization. Chapter Upper and lower bounds

Global minimization. Chapter Upper and lower bounds Chapter 1 Global minimization The issues related to the behavior of global minimization problems along a sequence of functionals F are by now well understood, and mainly rely on the concept of -limit.

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by L p Functions Given a measure space (, µ) and a real number p [, ), recall that the L p -norm of a measurable function f : R is defined by f p = ( ) /p f p dµ Note that the L p -norm of a function f may

More information

7068 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011

7068 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 7068 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Asymptotic Optimality Theory for Decentralized Sequential Multihypothesis Testing Problems Yan Wang Yajun Mei Abstract The Bayesian

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Theorem 5.3. Let E/F, E = F (u), be a simple field extension. Then u is algebraic if and only if E/F is finite. In this case, [E : F ] = deg f u.

Theorem 5.3. Let E/F, E = F (u), be a simple field extension. Then u is algebraic if and only if E/F is finite. In this case, [E : F ] = deg f u. 5. Fields 5.1. Field extensions. Let F E be a subfield of the field E. We also describe this situation by saying that E is an extension field of F, and we write E/F to express this fact. If E/F is a field

More information

A Rothschild-Stiglitz approach to Bayesian persuasion

A Rothschild-Stiglitz approach to Bayesian persuasion A Rothschild-Stiglitz approach to Bayesian persuasion Matthew Gentzkow and Emir Kamenica Stanford University and University of Chicago December 2015 Abstract Rothschild and Stiglitz (1970) represent random

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Convexity, Classification, and Risk Bounds

Convexity, Classification, and Risk Bounds Convexity, Classification, and Risk Bounds Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@stat.berkeley.edu Michael I. Jordan Computer

More information

Margin Maximizing Loss Functions

Margin Maximizing Loss Functions Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing

More information

Nonparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel

Nonparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel IEEE TRASACTIOS O SIGAL PROCESSIG, VOL. X, O. X, X X onparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel Weiguang Wang, Yingbin Liang, Member, IEEE, Eric P. Xing, Senior

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

Lecture 2: Basic Concepts of Statistical Decision Theory

Lecture 2: Basic Concepts of Statistical Decision Theory EE378A Statistical Signal Processing Lecture 2-03/31/2016 Lecture 2: Basic Concepts of Statistical Decision Theory Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: John Miller and Aran Nayebi In this lecture

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This

More information

Minimax risk bounds for linear threshold functions

Minimax risk bounds for linear threshold functions CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability

More information

Estimation of signal information content for classification

Estimation of signal information content for classification Estimation of signal information content for classification The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published

More information

Examples of Dual Spaces from Measure Theory

Examples of Dual Spaces from Measure Theory Chapter 9 Examples of Dual Spaces from Measure Theory We have seen that L (, A, µ) is a Banach space for any measure space (, A, µ). We will extend that concept in the following section to identify an

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Polyhedral Computation. Linear Classifiers & the SVM

Polyhedral Computation. Linear Classifiers & the SVM Polyhedral Computation Linear Classifiers & the SVM mcuturi@i.kyoto-u.ac.jp Nov 26 2010 1 Statistical Inference Statistical: useful to study random systems... Mutations, environmental changes etc. life

More information

Math Tune-Up Louisiana State University August, Lectures on Partial Differential Equations and Hilbert Space

Math Tune-Up Louisiana State University August, Lectures on Partial Differential Equations and Hilbert Space Math Tune-Up Louisiana State University August, 2008 Lectures on Partial Differential Equations and Hilbert Space 1. A linear partial differential equation of physics We begin by considering the simplest

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

Notes on Ordered Sets

Notes on Ordered Sets Notes on Ordered Sets Mariusz Wodzicki September 10, 2013 1 Vocabulary 1.1 Definitions Definition 1.1 A binary relation on a set S is said to be a partial order if it is reflexive, x x, weakly antisymmetric,

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory Part V 7 Introduction: What are measures and why measurable sets Lebesgue Integration Theory Definition 7. (Preliminary). A measure on a set is a function :2 [ ] such that. () = 2. If { } = is a finite

More information

Proof. We indicate by α, β (finite or not) the end-points of I and call

Proof. We indicate by α, β (finite or not) the end-points of I and call C.6 Continuous functions Pag. 111 Proof of Corollary 4.25 Corollary 4.25 Let f be continuous on the interval I and suppose it admits non-zero its (finite or infinite) that are different in sign for x tending

More information