Surrogate Risk Consistency: the Classification Case

Size: px

Start display at page:

Download "Surrogate Risk Consistency: the Classification Case"

Rosa McDaniel
5 years ago
Views:

1 Chapter 11 Surrogate Risk Consistency: the Classification Case I. The setting: supervised prediction problem (a) Have data coming in pairs (X,Y) and a loss L : R Y R (can have more general losses) (b) Often, it is hard to minimize L (for example, if L is non-convex), so we use a surrogate ϕ (c) We would like to compare the risks of functions f : X R: R ϕ (f) := E[ϕ(f(X),Y)] and R(f) := E[L(f(X),Y)] In particular, when does minimizing the surrogate give minimization of the true risk? (d) Our goal: when we define the Bayes risks R ϕ and R Definition 11.1 (Fisher consistency). We say the loss ϕ is Fisher consistent if for any sequence of functions f n Rϕ(fn) R implies R(f n ) R II. Classification case (a) We focus on the binary classification case so that Y { 1,1} 1. Margin-based losses: predict sign correctly, so for α R, ϕ L(α,y) = 1{αy 0} and ϕ(α,y) = (yα). 2. Consider conditional version of risks. Let η(x) = P(Y = 1 X = x) be conditional probability, then and R(f) = E[1{f(X)Y 0}] = P(sign(f(X)) Y) = E[η(X)1{f(X) 0}+(1 η(x))1{f(x) 0}] = E[l(f(X),η(X))] R (f) = E[(Yf(X))] = E[η(X)(f(X))+(1 η(x))( f(x))] = E[l (f(x),η(x))] where we have defined the conditional risks l(α,η) = η1{α 0}+(1 η)1{α 0} and l (α,η) = η(α)+(1 η)( α). 105

2 3. Note the minimizer of l: we have α (η) = sign(η 1/2), and f (X) = sign(η(x) 1/2) minimizes risk R(f) over all f 4. Minimizing f can be achieved pointwise, and we have R = E[inf α l(α,η(x))] and R = E[inf α l (α,η(x))]. (b) Example 11.1 (Exponential loss): Consider the exponential loss, used in AdaBoost (among other settings), which sets (α) = e α. In this case, we have argminl (α,η) = 1 α 2 log η 1 η because α l (α,η) = ηe α +(1 η)e α. Thus f (x) = 1 η(x) 2 log 1 η(x), and this is Fisher consistent. (c) Classification calibration 1. Consider pointwise versions of risk (all that is necessary, turns out) 2. Define the infimal conditional -risks as l (η) := inf l (α,η) and l wrong α (η) := inf l (α,η). α(η 1/2) 0 3. Intuition: if we always have l (η) < lwrong (η) for all η, we should do fine 4. Define the sub-optimality function H : [0,1] R ( ) ( ) H(δ) := l wrong 1+δ 1+δ l. 2 2 Definition The margin-based loss is classification calibrated if H(δ) > 0 for all δ > 0. Equivalently, for any η 1 2, we have l (η) < lwrong (η). 5. Example (Example 11.1 continued): For the exponential loss, we have l wrong { (η) = inf ηe α +(1 η)e α} = e 0 = 1 α(2η 1) 0 while the unconstrained minimal conditional risk is l (η) = η 1 η η η +(1 η) 1 η = 2 η(1 η), so that H(δ) = 1 1 δ δ2. Example 11.2 (Hinge loss): We can also consider the hinge loss, which is defined as (α) = [1 α] +. We first compute the minimizers of the conditional risk; we have l (α,η) = η[1 α] + +(1 η)[1+α] +, whose unique minimizer (for η {0, 1 2,1}) is α(η) = sign(2η 1). We thus have l (η) = 2min{η,1 η} and lwrong (η) = η +(1 η) = 1. We obtain H(δ) = 1 min{1+δ,1 δ} = δ. Comparing to the sub-optimality function for exp-loss, is tighter. 106

3 6. Pictures: use exponential loss, with η and without. (d) Our goal: using classification calibration, find some function ψ such that ψ(r (f) R ) R(f) R, where ψ(δ) > 0 for all δ > 0. Can we get a convex version of H, them maybe use Jensen s inequality to get the results? Turns out we will be able to do this. III. Some necessary asides on convex analysis (a) Epigraphs and closures 1. For a function f, the epigraph epif is the set of points (x,t) such that f(x) t 2. A function f is said to be closed if its epigraph is closed, which for convex f occurs if and only if f is lower semicontinuous (meaning liminf x x0 f(x) f(x 0 )) 3. Note: a one-dimensional closed convex function is continuous Lemma Let f : R R be convex. Then f is continuous on the interior of its domain. (Proof in notes; just give a picture) Lemma Let f : R R be closed convex. Then f is continuous on its domain. 4. The closure of a function f is the function cl f whose epigraph is the closed convex hull of epif (picture) (b) Conjugate functions (Fenchel-Legendre transform) 1. Let f : R d R be an (arbitrary) function. Its conjugate (or Fenchel-Legendre conjugate) is defined to be f (s) := sup{ t,s f(t)}. t (Picture here) Note that we always have f (s)+f(t) s,t, or f(t) s,t f (s) 2. The Fenchel biconjugate is defined to be f (t) = sup s { t,s f (s)} (Picture here, noting that f (t) = s implies f (t) = ts f(t)) 3. In fact, the biconjugate is the largest closed convex function smaller than f: Lemma We have f (x) = sup { a,x b : a,t b f(t) for all t}. a R d,b R Proof Let A R d R denote all the pairs (a,b) minorizing f, that is, those pairs such that f(t) a,t b for all t. Then we have (a,b) A f(t) a,t b for all t b a,t f(t) all t b f (a) and a domf. Thus we obtain the following sequence of equalities: sup { a,t b} = sup{ a,t b : a domf, b f (a)} (a,b) A = sup{ a,t f (a)}. So we have all the supporting hyperplanes to the graph of f as desired. 107

4 4. Other interesting lemma: Lemma Let h be either (i) continuous on [0,1] or (ii) non-decreasing on [0,1]. (And set h(1 + δ) = + for δ > 0.) If h satisfies h(t) > 0 for t > 0 and h(0) = 0, then f(t) = h (t) satisfies f(t) > 0 for any t > 0. (Proof by picture) IV. Classification calibration results: (a) Getting quantitative bounds on risk: define the ψ-transform via (b) Main theorem for today: ψ(δ) := H (δ). (11.0.1) Theorem Let be a margin-based loss function and ψ the associated ψ-transform. Then for any f : X R, Moreover, the following three are equivalent: 1. The loss is classification-calibrated 2. For any sequence δ n [0,1], ψ(r(f) R ) R (f) R. (11.0.2) ψ(δ n ) 0 δ n For any sequence of measurable functions f n : X R, R (f n ) R implies R(f n ) R. 1. Some insights from theorem. Recall examples 11.1 and For both of these, we have that ψ(δ) = H(δ), as H is convex. For the hinge loss, (α) = [1 α] +, we obtain for any f that P(Yf(X) 0) inf f P(Yf(X) 0) E[ [1 Yf(X)] + ] inf f E[ [1 Yf(X)] + ]. On the other hand, for the exponential loss, we have ( ) 1 2 P(Yf(X) 0) infp(yf(x) 0) E[exp( Yf(X))] inf 2 E[exp( Yf(X))]. f f The hinge loss is sharper. 2. Example 11.8 (Regression for classification): What about the surrogate loss 1 2 (f(x) y)2? In the homework, show which margin this corresponds to, and moreover, H(δ) = 1 2 δ2. So regressing on the labels is consistent. (c) Proof of Theorem 11.7 The proof of the theorem proceeds in several parts. 1. We first state a lemma, which follows from the results on convex functions we have already proved. The lemma is useful for several different parts of our proof. Lemma We have the following. a. The functions H and ψ are continuous. 108

5 b. We have H 0 and H(0) = 0. c. If H(δ) > 0 for all δ > 0, then ψ(δ) > 0 for all δ > 0. Because H(0) = 0 and H 0: we have l wrong (1/2) := inf l (α,1/2) = inf l (α,1/2) = l α(1 1) 0 α (1/2), so H(0) = l (1/2) l (1/2) = 0. (It is clear that the sub-optimality gap H 0 by construction.) 2. We begin with the first statement of the theorem, inequality (11.0.2). Consider first the gap (for a fixed margin α) in conditional 0-1 risk, l(α,η) inf α l(α,η) = η1{α 0}+(1 η)1{α 0} η1{η 1/2} (1 η)1{η 1/2} { 0 if sign(α) = sign(η 1 = 2 ) η (1 η) η (1 η) = 2η 1 if sign(α) sign(η 1 2 ). In particular, we obtain that the gap in risks is R(f) R = E[1{sign(f(X)) sign(2η(x) 1)} 2η(X) 1 ]. (11.0.3) Now we use expression (11.0.3) to get an upper bound on R(f) R via the -risk. Indeed, consider the ψ-transform (11.0.1). By Jensen s inequality, we have that ψ(r(f) R ) E[ψ(1{sign(f(X)) sign(2η(x) 1)} 2η(X) 1 )]. Now we recall from Lemma 11.9 that ψ(0) = 0. Thus we have ψ(r(f) R ) E[ψ(1{sign(f(X)) sign(2η(x) 1)} 2η(X) 1 )] = E[1{sign(f(X)) sign(2η(x) 1)}ψ( 2η(X) 1 )] (11.0.4) Now we use the special structure of the suboptimality function we have constructed. Note that ψ H, and moreover, we have for any α R that [ ] 1{sign(α) sign(2η 1)}H( 2η 1 ) = 1{sign(α) sign(2η 1)} inf l (α,η) l (η) α(2η 1) 0 because (1+ 2η 1 )/2 = max{η,1 η}. Combining inequalities (11.0.4) and (11.0.5), we see that l (α,η) l (η), (11.0.5) ψ(r(f) R ) E[1{sign(f(X)) sign(2η(x) 1)}H( 2η(X) 1 )] E [ l (f(x),η(x)) l (η(x))] = R (f) R, which is our desired result. 3. Having proved the quantitative bound (11.0.2), we now turn to proving the second part of Theorem Using Lemma 11.9, we can prove the equivalence of all three items. We begin by showing that IV(b)1 implies IV(b)2. If is classification calibrated, we have H(δ) > 0 for all δ > 0. Because ψ is continuous and ψ(0) = 0, if δ 0, then 109

6 ψ(δ) 0. It remains to show that ψ(δ) 0 implies that δ 0. But this is clear because we know that ψ(0) = 0 andψ(δ) > 0 whenever δ > 0, and the convexity of ψ implies that ψ is increasing. To obtain IV(b)3 from IV(b)2, note that by inequality (11.0.2), we have ψ(r(f n ) R ) R (f n ) R 0, so we must have that δ n = R(f n ) R 0. Finally, we show that IV(b)1 follows from IV(b)3. Assume for the sake of contradiction that IV(b)3 holds but IV(b)1 fails, that is, is not classification calibrated. Then there must exist η < 1/2 and a sequence α n 0 (i.e. a sequence of predictions with incorrect sign) satisfying l (α n,η) l (η). Construct the classification problem with a singleton X = {x}, and set P(Y = 1) = η. Then the sequence f n (x) = α n satisfies R (f n ) R but the true 0-1 risk R(f n) R. V. Classification calibration in the convex case a. Suppose that is convex, which we often use for computational reasons b. Theorem (Bartlett, Jordan, McAuliffe [1]). If is convex, then is classification calibrated if and only if (0) exists and (0) < 0. Proof First, suppose that is differentiable at 0 and (0) < 0. Then l (α,η) = η(α)+(1 η)( α) satisfies l (0,η) = (2η 1) (0), and if (0) < 0, this quantity is negative for η > 1/2. Thus the minimizing α(η) (0, ]. (Proof by picture, but formalize in full notes.) For the other direction assume that is classification calibrated. Recall the definition of a subgradient g α of the function at α R is any g α such that (t) (α)+g α (t α) for all t R. (Picture.) Let g 1,g 2 be such that l(α) l(0)+g 1 α and l(α) l(0)+g 2 α, which exist by convexity. We show that both g 1,g 2 < 0 and g 1 = g 2. By convexity we have l (α,η) η((0)+g 1 α)+(1 η)((0) g 2 α) = [ηg 1 (1 η)g 2 ]α+(0). (11.0.6) We first show that g 1 = g 2, meaning that is differentiable. Without loss of generality, assume g 1 > g 2. Then for η > 1/2, we would have ηg 1 (1 η)g 2 > 0, which would imply that l (α,η) (0) inf α 0 { η(α )+(1 η)( α ) } = l wrong (η), for all α 0 by (11.0.6), by taking α = 0 in the second inequality. By our assumption of classification calibration, for η > 1/2 we know that inf α l (α,η) < inf l (α,η) = l wrong α 0 (η) so l (η) = inf l (α,η), α 0 and under the assumption that g 1 > g 2 we obtain l (η) = inf α 0l (α,η) > l wrong (η), which is a contradiction to classification calibration. We thus obtain g 1 = g 2, so that the function has a unique subderivative at α = 0 and is thus differentiable. 110

7 Now that we know is differentiable at 0, consider η(α)+(1 η)( α) (2η 1) (0)α+(0). If (0) 0, then for α 0 and η > 1/2 we must have the right hand side is at least (0), which contradicts classification calibration, because we know that l (η) < lwrong (η) exactly as in the preceding argument Proofs of convex analytic results Proof of Lemma 11.4 First, let (a,b) domf and fix x 0 (a,b). Let x x 0, which is no loss of generality, and we may also assume x (a,b). Then we have for some α,β [0,1]. Rearranging by convexity, x = αa+(1 α)x 0 and x 0 = βb+(1 β)x f(x) αf(a)+(1 α)f(x 0 ) = f(x 0 )+α(f(a) f(x 0 )) and Taking α,β 0, we obtain f(x 0 ) βf(b)+(1 β)f(x), or 1 1 β f(x 0) f(x)+ β 1 β f(b). lim inf x x 0 f(x) f(x 0 ) and limsup x x 0 f(x) f(x 0 ) as desired Proof of Lemma 11.4 We need only consider the endpoints of the domain by Lemma 11.3, and we only need to show that limsup x x0 f(x) f(x 0 ). But this is obvious by convexity: let x = ty + (1 t)x 0 for any y domf, and taking t 0, we have f(x) tf(y)+(1 t)f(x 0 ) f(x 0 ) Proof of Lemma 11.6 Webeginwith thecase(i). Definethefunction h low (t) := inf s t h(s). Then becausehiscontinuous, weknowthatoveranycompactsetitattainsitsinfimum, andthus(byassumptiononh)h low (t) > 0 for all t > 0. Moreover, h low is non-decreasing. Now define f low (t) = h low (t) to be the biconjugate of h low ; it is clear that f f low as h h low. Thus we see that case (ii) implies case (i), so we turn to the more general result to see that f low (t) > 0 for all t > 0. For the result in case (ii), assume for the sake of contradiction there is some z (0,1) satisfying h (z) = 0. It is clear that h (0) = 0 and h 0, so we must have h (z/2) = 0. Now, by 111

8 assumption we have h(z/2) = b > 0, whence we have h(1) b > 0. In particular, the piecewise linear function defined by { 0 if t z/2 g(t) = b 1 z/2 (t z/2) if t > z/2 is closed, convex, and satisfies g h. But g(z) > 0 = h (z), a contradiction to the fact that h is the largest (closed) convex function below h. 112

9 Bibliography [1] P. L. Bartlett, M. I. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101: ,

Calibrated Surrogate Losses

EECS 598: Statistical Learning Theory, Winter 2014 Topic 14 Calibrated Surrogate Losses Lecturer: Clayton Scott Scribe: Efrén Cruz Cortés Disclaimer: These notes have not been subjected to the usual scrutiny