Stanford Statistics 311/Electrical Engineering 377

Size: px

Start display at page:

Download "Stanford Statistics 311/Electrical Engineering 377"

Paulina Cameron
5 years ago
Views:

1 I. Bayes risk in classification problems a. Recall definition (1.2.3) of f-divergence between two distributions P and Q as ( ) p(x) D f (P Q) : q(x)f dx, q(x) where f : R + R is a convex function satisfying f(1) 0. If f is not linear, then D f (P Q) > 0 unless P Q. b. Focusing on binary classification case, let us consider some example risks and see what connections they have to f-divergences. (Recall we have X X and Y { 1,1} we would like to classify.) 1. We require a few definitions to understand the performance of different classification strategies. In particular, we consider the difference between the risk possible when we see a point to classify and when we do not. 2. The prior risk is the risk attainable without seeing x, we have for a fixed sign R the definition R prior () : P(Y 1)1{ 0}+P(Y 1)1{ 0}, (11.1.1) and similarly the minimal prior risk, defined as R prior : inf {P(Y 1)1{ 0}+P(Y 1)1{ 0}} min{p(y 1),P(Y 1)}. 3. Also have the prior φ-risk, defined as (11.1.2) R φ,prior () : P(Y 1)φ()+P(Y 1)φ( ), (11.1.3) and the minimal prior φ-risk, defined as R φ,prior : inf {P(Y 1)φ()+P(Y 1)φ( )}. (11.1.4) c. Examples of 0-1 loss and its friends: have X X and Y { 1,1}. 1. Example (Binary classification with 0-1 loss): What is Bayes risk of binary classifier? Let P(Y 1 X x)p(x) p +1 (x) p(x Y 1) P(Y 1) be the density of X conditional on Y 1 and similarly for p 1 (x), and assume that each class occurs with probability 1/2. Then R inf 1{(x) 0}P(Y 1 X x)+1{(x) 0}P(Y 1 X x)]p(x)dx 1 2 inf 1{(x) 0}p +1 (x)+1{(x) 0}p 1 (x)]dx 1 min{p +1 (x),p 1 (x)}dx. 2 Similarly, we may compute the minimal prior risk, which is simply 1 2 by definition (11.1.2). Looking at the gap between the two, we obtain Rprior R min{p +1 (x),p 1 (x)}dx 1 p 1 p 1 ] P 1 P 1 TV. That is, the difference is half the variation distance between P 1 and P 1, the distributions of x conditional on the label Y. 114

2 2. Example (Binary classification with hinge loss): We now repeat precisely the same calculations as in Example 11.11, but using as our loss the hinge loss (recall Example 11.2). In this case, the minimal φ-risk is Rφ inf 1 ]+ P(Y 1 X x)+1+] + P(Y 1 X x) ] p(x)dx 1 inf 1 ]+ p 1 (x)+1+] 2 + p 1 (x) ] dx min{p 1 (x),p 1 (x)}dx. We can similarly compute the prior risk as Rφ,prior 1. Now, when we calculate the improvement available via observing X x, we find that Rφ,prior R φ 1 min{p 1 (x),p 1 (x)}dx P 1 P 1 TV, which is suggestively similar to Example d. Is there anything more we can say about this? II. Statistical information, f-divergences, and classification problems a. Statistical information 1. Suppose we have a classification problem with data X X and labels Y { 1,1}. A natural notion of information that X carries about Y is the gap R prior R, (11.1.5) that between the prior risk and the risk attainable after viewing x X. 2. Didn t present this. True definition of statistical information: suppose class 1 has prior probability π and class 1 has prior 1 π, and let P 1 and P 1 be the distributions of X X given Y 1 and Y 1, respectively. The Bayes risk associated with the problem is then B π (P 1,P 1 ) : inf 1{(x) 0}p 1 (x)π +1{(x) 0}p 1 (x)(1 π)]dx (11.1.6) p 1 (x)π p 1 (x)(1 π)dx and similarly, the prior Bayes risk is B π : inf Then statistical information is {1{ 0}π +1{ 0}(1 π)} π (1 π). (11.1.7) B π B π (P 1,P 1 ). (11.1.8) 3. Measure proposed by DeGroot 1] in experimental design problem; goal is to infer state of world based on further experiments, want to measure quality of measurement. 4. Saw that for 0-1 loss, when a-priori each class was equally likely, then R prior R 1 2 P 1 P 1 TV, and similarly for hinge loss (Example 11.12) that R φ,prior R φ P 1 P 1 TV. 115

3 5. Note that if P 1 P 1, then the statistical information is positive. b. Did present this. More general story? Yes. 1. Consider any margin-based surrogate loss φ, and look at the difference between B φ,π (P 1,P 1 ) : inf φ((x))p 1 (x)π +φ( (x))p 1 (x)(1 π)]dx inf φ()p 1(x)π +φ( )p 1 (x)(1 π)]dx and the prior φ-risk, B φ,π. 2. Note that B φ,π B φ,π (P 1,P 1 ) is simply gap in φ-risk R φ,prior R φ for distribution with P(Y 1) π and P(Y y X x) p(x Y y)p(y y) p(x) p y(x)π 1{y1} (1 π) 1{y 1}. (11.1.9) πp 1 (x)+(1 π)p 1 (x) c. Have theorem (see, for example, Liese and Vajda 2], or Reid and Williamson 4]): Theorem Let P 1 and P 1 be arbitrary distributions on X, and let π 0,1] be a prior probability of a class label. Then there is a convex function f π,φ : R + R satisfying f π,φ (1) 0 such that Moreover, this function f π,φ is f π,φ (t) sup B φ,π B φ,π (P 1,P 1 ) D fπ,φ (P 1 P 1 ). l πφ()t+(1 π)φ( ) φ (π) πt+(1 π) ] (tπ +(1 π)). ( ) Proof First, consider the integrated Bayes risk. Recalling the definition of the conditional distribution η(x) P(Y 1 X x), we have l B φ,π B φ,π (P 1,P 1 ) φ (π) l φ (η(x))] p(x)dx sup l φ (π) φ()p(y 1 x) φ( )P(Y 1 x) ] p(x)dx sup l φ (π) φ()p 1(x)π φ( ) p ] 1(x)(1 π) p(x)dx, p(x) p(x) where we have used Bayes rule as in (11.1.9). Let us now divide all appearances of the density p 1 by p 1, which yields B φ,π B φ,π (P 1,P 1 ) sup l φ (π) φ() p 1(x) p 1 (x) π +φ( )(1 π) ( ) p1 (x) p 1 (x) p 1 (x) π +(1 π) p 1 (x) π +(1 π) p 1 (x)dx. ( ) 116

4 By inspection, the representation ( ) gives the result of the theorem if we can argue that the function f π is convex, where we substitute p 1 (x)/p 1 (x) for t in f π (t). To see that the function f π is convex, consider the intermediate function s π (u) : sup{ πφ()u (1 π)φ( )}. This is the supremum of a family of linear functions in the variable u, so it is convex. Moreover, as we noted in the first exercise set, the perspective of a convex function g, defined by h(u,t) tg(u/t) for t 0, is jointly convex in u and t. Thus, as f π (t) l φ (π)+s π ( t πt+(1 π) ) (πt+(1 π)), we have that f π is convex. It is clear that f π (1) 0 by definition of l φ (π). d. Take-home: any loss function induces an associated f-divergence. (There is a complete converse, in that any f-divergence can be realized as the difference in prior and posterior Bayes risk for some loss function; see, for example, Liese and Vajda 2] for results of this type.) III. Quantization and other types of empirical minimization a. Do these equivalences mean anything? What about the fact that the suboptimality function H φ was linear for the hinge loss? b. Consider problems with quantization: we must jointly learn a classifier (prediction or discriminant function) and a quantizer q : X {1,...,k}, where k is fixed and we wish to find an optimal quantizer q Q, where Q is some family of quantizers. Recall the notation (1.2.1) of quantization of f-divergence, so D f (P 0 P 1 q) P 1 (q 1 (i))f i1 ( P0 (q 1 (i)) P 1 (q 1 (i)) ) P 1 (A i )f i1 ( ) P0 (A i ) P 1 (A i ) where the A i are the quantization regions of X. c. Using Theorem 11.13, we can show how quantization and learning can be unified. 1. Quantized version of risk: for q : X {1,...,k} and : k] R, R φ ( q) Eφ(Y(q(X)))] 2. Rearranging and using integration, R φ ( q) Eφ(Y(q(X)))] Eφ(Y(z)) q(x) z]p(q(x) z) φ((z))p(y 1 q(x) z)+φ( (z))p(y 1 q(x) z)]p(q(x) z) ] P(q(X) z Y 1)P(Y 1) P(q(X) z Y 1)P(Y 1) φ((z)) +φ( (z)) P(q(X) P(q(X) z) P(q(X) z) φ((z))p 1 (q(x) z)π +φ( (z))p 1 (q(x) z)(1 π)]. 117

5 3. Let P q denote the distribution with probability mass function and define quantized Bayes φ-risk P q (z) P(q(X) z) P(q 1 ({z})), Rφ (q) inf R φ( q) Then for problem with P(Y 1) π, we have R φ,prior R φ (q) B φ,π B φ,π (P q 1,Pq 1 ) D f π,φ (P 1 P 1 q). ( ) d. Result unifying quantization and learning: we say that loss functions φ 1 and φ 2 are universally equivalent if they induce the same f divergence ( ), that is, there is a constant c > 0 and a,b R such that f π,φ1 (t) cf π,φ2 (t)+at+b for all t. ( ) Theorem Let φ 1 and φ 2 be equivalent margin-based surrogate loss functions. Then for any quantizers q 1 and q 2, R φ 1 (q 1 ) R φ 1 (q 2 ) if and only if R φ 2 (q 1 ) cr φ 2 (q 2 ). Proof The proof follows straightforwardly via the representation ( ). If φ 1 and φ 2 are equivalent, then we have that R φ 1,prior R φ 1 (q) D fπ,φ1 (P 1 P 1 q) cd fπ,φ2 (P 1 P 1 q)+a+b for any quantizer q. In particular, we have c R φ 2,prior R φ 2 (q) ] +a+b R φ 1 (q 1 ) R φ 1 (q 2 ) if and only if R φ 1,prior R φ 1 (q 1 ) R φ 1,prior R φ 1 (q 2 ) if and only if D fπ,φ1 (P 1 P 1 q 1 ) D fπ,φ1 (P 1 P 1 q 2 ) if and only if D fπ,φ2 (P 1 P 1 q 1 ) D fπ,φ2 (P 1 P 1 q 2 ) if and only if R φ 2,prior R φ 2 (q 1 ) R φ 2,prior R φ 2 (q 2 ). Subtracting R φ 2,prior from both sides gives our desired result. e. Some comments: 1. We have an interesting thing: if we wish to learn a quantizer and a classifier jointly, then this is possible by using any loss equivalent to the true loss we care about. 2. Example: hinge loss and 0-1 loss are equivalent. 3. Turns out that the condition that the losses φ 1 and φ 2 be equivalent is (essentially) necessary and sufficient for two quantizers to induce the same ordering 3]. This equivalence is necessary and sufficient for the ordering conclusion of Theorem

6 Bibliography 1] M. DeGroot. Optimal Statistical Decisions. Mcgraw-Hill College, ] F. Liese and I. Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10): , ] X. Nguyen, M. J. Wainwright, and M. I. Jordan. On surrogate loss functions and f-divergences. Annals of Statistics, 37(2): , ] M. Reid and R. Williamson. Information, divergence, and risk for binary experiments. Journal of Machine Learning Research, 12: ,

Convexity, Detection, and Generalized f-divergences

Convexity, Detection, and Generalized f-divergences Khashayar Khosravi Feng Ruan John Duchi June 5, 015 1 Introduction The goal of classification problem is to learn a discriminant function for classification