Classification with Reject Option

Size: px

Start display at page:

Download "Classification with Reject Option"

Jennifer Robertson
6 years ago
Views:

1 Classification with Reject Option Bartlett and Wegkamp (2008) Wegkamp and Yuan (2010) February 17, 2012

2 Outline. Introduction.. Classification with reject option. Spirit of the papers BW Infinite sample consistency.. Bounding excess risk.. Convergence of excess risk. Results.. Generalized hinge loss (SVM) BW Further generalization WY2010

3 Introduction reject option classification: observed data: {X, Y } X {±1} discriminant function: f : X R, typically classify by sign(f ) reject option: withhold for cost d when f < δ, and proceed as usual if f > δ for specified d, δ motivation: want to avoid making hard decisions

4 Introduction loss loss as a function of Yf (X ) typically: l(z) = { 1 if z 0 0 if z > 0 reject option: 1 if z < δ l d,δ (z) = 0 if z δ d if z δ for some d [0, 1/2], δ (0, 1)

5 Introduction loss loss as a function of Yf (X ) reject option: 1 if z < δ l d,δ (z) = 0 if z δ d if z δ for some d [0, 1/2], δ (0, 1) if d 0 always reject, if d > 1/2 never reject intuition: d is necessary amount of confidence for classifying

6 Introduction Bayes rule minimize L d,δ (f ) = E{l d,δ (Yf (X ))} f d (X ) = where η(x ) = P(Y = +1 X ) +1 if η(x ) > 1 d 0 if d η(x ) 1 d 1 if η(x ) < d follows from conditional expectation: E Y X {l d,δ (Yf (X ))} = η(x ) l d,δ (f (X )) + (1 η(x )) l d,δ ( f (X ))

7 Introduction Bayes rule minimize L d,δ (f ) = E{l d,δ (Yf (X ))} f d (X ) = where η(x ) = P(Y = +1 X ) +1 if η(x ) > 1 d 0 if d η(x ) 1 d 1 if η(x ) < d in practice, minimize empirical risk: P n (f ) = 1 n n l d,δ (Y i f (X i )) i=1 minimizing P n is NP hard

8 General spirit propose convex surrogate loss φ d consider φ d risk: Q(f ) = E{φ d (Yf (X )} minimize empirical risk: Q n (f ) let ˆf n = argmin f F Q n study the performance of φ d.. consistency.. bounding excess risk.. convergence rates for empirical maximizer

9 General spirit. consistency.. f d = argmin f Q(f ).. asymptotic minimizer of φ d risk is same as for l d,δ risk.. able to recover f d. bounding excess risk.. want some non-decreasing function ρ( ) such that: L d,δ (f ) L d,δ (f d ) ρ(q(f ) Q(f d )) L d,δ (f ) ρ( Q(f )).. guarantee performance of f. convergence rate for empirical maximizer.. compute rate for Q(ˆf n ) 0.. combine with bound on excess risk

10 Generalized hinge loss convex surrogate of the form: 1 ( ) 1 d d z if z 0 φ d (z) = 1 z if 0 < z 1 0 if z > 1 consistency Proposition 1 The Bayes discriminant function fd Q with: dq(f d ) = L d,δ(f d ) minimizes risk

11 Generalized hinge loss convex surrogate of the form: 1 ( ) 1 d d z if z 0 φ d (z) = 1 z if 0 < z 1 0 if z > 1 excess risk bounds note: for δ 1 d, l d,φ (X ) φ d (X ) Theorem 2 for fixed f, d [0, 1/2): for all δ (0, 1/2], for all δ [1/2, 1 d], L d,δ (f ) d δ Q(f ) L d,δ(f ) Q(f ) recommend using δ = 1/2

12 Generalized hinge loss convex surrogate of the form: 1 ( ) 1 d d z if z 0 φ d (z) = 1 z if 0 < z 1 0 if z > 1 excess risk bounds note: for δ 1 d, l d,φ (X ) φ d (X ) Theorem 2 for fixed f, (δ, d) = (0, 1/2): L(f ) Q(f ) risk bounds for usual classification with hinge loss

13 Generalized hinge loss convex surrogate of the form: 1 ( ) 1 d d z if z 0 φ d (z) = 1 z if 0 < z 1 0 if z > 1 convergence rates bounds of the form: EQ(ˆf n ) Q(fd ) 2 inf ( Q(f ) Q(f d ) ) + ɛ n f F. approximation error for space F.. method of sieves. estimation error for finite sample n authors results

14 Generalized hinge loss convergence rates Margin condition: we say that η satisfies the margin condition at d with exponent α > 0 if there is a c 1 such that for all t > 0, P{ η(x ) d t} ct α P{ η(x ) (1 d) t} ct α Bernstein class: we say that G L 2 (P) is a (β, B)-Bernstein class with respect to the probability measure P, with β (0, 1] and B 1 if every g G satisfies Pg 2 B(Pg) β

15 Generalized hinge loss convergence rates Lemma 8 if η satisfies the margin condition at d with exponent α, then for any class F of measurable uniformly bounded functions, the class G = {g f : f F} has a Bernstein exponent β = where g f (x, y) = φ d (yf (x)) φ d (yf d (x)). the excess φ d loss. E(g f (X, Y )) = Q(f ) α 1+α

16 Generalized hinge loss convergence rates Theorem 12 if η satisfies the margin condition at d with exponent α, F is a countable class of functions f : X R satisfying f B, and F satisfies log N(ɛ, L, F) Cɛ p for all ɛ > 0 and some 0 p 2, then there exists a constant C independent of n, such that EQ(ˆf n ) Q(fd ) 2 inf ( Q(f ) Q(f d ) ) + C n 1+α 2+p+α+pα f F

17 Generalized hinge loss - QCQP can be solved as the following quadratic constraint quadratic programming problem (QCQP) Theorem 5 for any x 1,..., x n X and y 1,..., y n {±1}, let ˆf H be the solution to where r > 0 minimize f n φ d (y i f (x i )) i=1 such that f 2 r 2

18 Generalized hinge loss - QCQP can be solved as the following quadratic constraint quadratic programming problem (QCQP) Theorem 5 Then we can represent ˆf as the finite sum ˆf (x) = n i=1 ˆα ik(x i, x), where ˆα 1,..., ˆα n is the solution to the following QCQP: 1 n min α i,ξ i γ i n i=1 such that i,j ( ξ i + 1 2d d γ i ) α i α j K(x i, x j ) r 2 n ξ i 0, ξ i 1 y i α j K(x i, x j ) j=1 n γ i 0, γ i y i α j K(x i, x j ) j=1

19 Generalization generalize to the following loss 1 if g(x ) Y and g(x ) 0 l[g(x ), Y ] = d if g(x ) = 0 0 if g(x ) = Y with classification rule +1 if f (X ) > δ C(f (X ), δ) = 0 if f (X ) δ 1 if f (X ) < δ equivalence to previous case can be seen by: l[c(f (X ), δ), Y ] l d,δ (Yf (X )) since looking at a family of surrogates we want to separate δ from the loss function

20 Generalization consistency Theorem 1 assue for convex φ, the classification rule C(fφ, δ) for some δ > 0 is infinite sample consistent if and only if. φ (δ) and φ ( δ) both exist. φ (0) < 0. φ (δ) φ (δ)+φ ( δ) = d for strictly convex φ, at most one value of δ for generalized hinge loss, holds for δ (0, 1)

21 Generalization excess risk bounds Theorem 9 assume φ is a convex infinite sample consistent surrogate and suppose that there exists constants C > 0 and s 1 such that then, η d s C s Q η ( δ) (1 η) d s C s Q η (δ) R[C(f, δ)] 2C[ Q(f )] 1/s where Q η (z) = η(x )φ(z) + (1 η(x ))φ( z)

22 Generalization excess risk bounds Theorem 10 further assume 2 s (η 1/2) s + C s Q η ( δ) 2 s ((1 η) 1/2) s + C s Q η (δ) then, R[C(f, δ)] C[ Q(f )] 1/s Theorem 11 if margin condition holds for some c 1 and α 0, then, R[C(f, δ)] K[ Q(f )] 1/(s+β βs) where K depends on c and α, and β = α/(1 + α)

23 Generalization convergence rates Theorem 18 assume that f B for all f F and let γ (0, 1). with probability at least 1 γ Q(ˆf n ) inf Q(f ) + 3L ( L 2 f F n + 8 2c + B ) log(nn /γ) 6 n where N n = N(1/n, L, F) relies on:. Lipschitz continuity of φ. constraint on modulus of convexity of Q

24 Generalization convergence rates Corollary 19 under the assumptions of Theorems 9 and 18, we have, with probability at least 1 γ { R(C(ˆf n, δ)) 2C inf Q(f ) + 3L ( L 2 f F n + 8 2c + LB 3 ) log(nn /γ) n } 1/s if the margin condition also holds, with probability at least 1 γ { R(C(ˆf n, δ)) K inf Q(f ) + 3L ( L 2 f F n + 8 2c + LB 3 ) log(nn /γ) n } 1/(sβ βs)

25 Miscellanea other losses considered: least squares exponential logistic squared hinge DWD further extensions: asymmetric loss hinge loss with L 1 regularization YW Bernoulli, 2011

Lecture 10 February 23

EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only