Selective Prediction

Size: px

Start display at page:

Download "Selective Prediction"

Blaise Hunt
5 years ago
Views:

1 COMS Fall 2017 November 8, 2017 Selective Predictio Preseter: Rog Zhou Scribe: Wexi Che 1 Itroductio I our previous discussio o a variatio o the Valiat Model [3], the described learer has the ability to output I do t kow i additio to the regular biary outputs This is the behavior of a selective classifier I other words, a selective classifier is allowed to reject decisio makig without pealty Such classifiers are useful i makig medical predictios, because the cost of makig a wrog predictio is usually much higher tha refusig to make ay decisio i these situatios 11 Ideal Selective Classifier What would a ideal selective classifier looks like? Ituitively, misclassificatio rate should ot be the oly measuremet for selective classifiers Imagie a selective classifier that refuses makig ay predictio, such classifier is useless, but it has a misclassificatio rate of 0 Therefore, it makes sese to evaluate a selective classifier ot oly by its misclassificatio rate, but also the probability it will refuse makig predictio Let C be a selective classifier i biary classificatio settig, we defie the followig: Coverage (cover(c)): the probability that C predicts a label istead of refusig makig predictio Erorr (err(c)): the probability that the true label is differet from C s predictio whe C makes predictio Risk: risk(c) = err(c) cover(c) We seek to boud both error ad coverage of a ideal classifier with high probability (1 δ) where 0 δ 1 2 Realizable Settig Let X deote the feature space, D over X { 1, 1} be the uderlyig ukow data distributio Set S = {{x 1, y 1 }, {x 2, y 2 },, {x, y }} is a set of labelled examples, ad U = {x +1, x +2,, x +m } is a set of m ulabelled examples, where x i X ad y j { 1, 1} for 1 i + m, 1 j Let H be a set of hypotheses, ad h H be the target hypothesis such that the true label of x is the same as the predictio h (x) I additio, we 1

2 defie versio space V with respect to S to be the set of hypotheses that are cosistet with the examples i S We itroduce 3 selective classifiers: Cofidece-rated Predictor, CZ Selective Classifier ad Selective Classifier Strategy 21 Cofidece-rated Predictor A cofidece-rated predictor C maps from U to a set of m distributios over { 1, 0, 1} If the j-th distributio is [β j, 1 β j α j, α j ], the Pr{C(x +j ) = 1} = β j, Pr{C(x +j ) = 1} = α j ad Pr{C(x +j ) = 0} = 1 β j α j Algorithm 1 Cofidece-rated Predictor iput Labelled data S, ulabelled data U, error boud ɛ 1: Compute versio space V with respect to S 2: Solve the liear program: m max (α i + β i ) subject to: h V, i, α i + β i 1 i, α i, β i 0 β i + i:h(x +i )=1 i:h(x +i )= 1 α i ɛm 3: retur the cofidece-rated predictor {[β i, 1 β i α i, α i ], i = 1, 2,, m} Theorem 1 A cofidece-rated predictor produced by Algorithm 1 has a error guaratee ɛ with optimal coverage for the ulabelled examples i U Proof Accordig the the costraits i the liear program, 0 α i 1, 0 β i 1, 0 1 α i β i 1 ad the probability the predictor makes a wrog decisio with respect to the uiform distributio over U is less tha ɛm/m = ɛ Moreover, the liear program maximizes m α i + β i, which is equivalet as miimizig that is the coverage of the predictor 1 m m 1 α i β i Remark 1 The feasible regio of the liear program i Algorithm 1 is always o-empty Proof The cadidate solutio {[0, 1, 0], i = 1, 2,, 3} will always be i the feasible solutio set 2

3 22 CZ Selective Classifier A CZ selective classifier C is defied by a tuple (h, (γ 1, γ 2,, γ m )) where h H ad 0 γ i 1 for all i = 1, 2,, m Give ulabelled example x +i U, C predicts 0 with probability 1 γ i ad predicts C(x +i ) = h(x +i ) with probability γ i Algorithm 2 CZ Selective Classifier iput Labelled data S, ulabelled data U, error boud ɛ 1: Compute versio space V with respect to S 2: Radomly choose h 0 V 3: Solve the liear program: m max subject to: γ i i, 0 γ i 1 h V, γ i ɛm i:h(x +i ) h 0 (x +i ) 4: retur the CZ selective classifier (h 0, (γ 1, γ 2,, γ m )) Theorem 2 Let P be a cofidece-rated predictor produced by Algorithm 1 A CZ selective classifier C produced by Algorithm 2 has a error guaratees ɛ with cover(c) cover(p) ɛ Ituitively, there exists h 1 V that produces the most differet predictios o U from h 0 If h 1 ad h 0 lies i two opposite eds of the versio space, cover(c) would be worse tha the average case 23 Selective Classifier Strategy There are a few drawbacks usig cofidece-rated predictors ad CZ selective classifiers First, the umber of costraits ca be ifiite Secod, we eed to iput ulabelled dataset U ad these two methods oly produce predictio of examples i U However, we wat to geeralize the problem ad provide a solutio uder iductive settig Thus, we are goig to remove U from the iput to the algorithm I order to use selective classifier strategy, we itroduce a few additioal defiitios: Let d be the VC dimesio of H True error rate of hypothesis h is: err P (h) = Pr {h(x) Y } (X,Y ) D 3

4 empirical error rate of hypothesis h is: err S (h) = 1 1 {h(xi ) y i } For ay hypothesis class H, distributio D over X, ad real umber r > 0, ad V(h, r) = {h H : err P (h ) err P (h) + r} ˆV(h, r) = {h H : err S (h ) err S (h) + r} B(h, r) = {h H : Pr X D {h (X) h(x)} r} Disagreemet regio of hypothesis class H is: DIS(H) = {x X : h 1, h 2 H st h 1 (x) h 2 (x)} for G H, let G the volume of the disagreemet regio: G = Pr{DIS(G)} disagreemet coefficiet is: θ = sup r>0 B(h, r) r A ew type of selective classifier C st C(x) = (h, g)(x) = { h(x) if g(x) = 0 0 if g(x) = 0 ad cover(h, g) = E[g(X)] Algorithm 3 Selective Classifier Strategy iput labelled data S, d, δ Output a selective classifier (h, g) st risk(h, g) = risk(h, g) 1: Compute versio space V with respect to S 2: Radomly choose h 0 V 3: Set G = V 4: Costruct g st g(x) = 1 if ad oly if x {X \ DIS(G)} 5: Set h = h 0 4

5 Theorem 3 (Cosistet Hypothesis error rate boud i terms of VC dimesio, theorem 215 from lecture otes) For ay ad δ (0, 1), with probability at least 1 δ, every hypothesis h V has error rate err P (h) 4d l(2 + 1) + 4 l 4 δ Theorem 4 Let C = (h, g) be a selective classifier output by Algorithm 3, the h achieves 0 error rate whe it makes predictio, ad C achieves ear optimal coverage with probability 1 δ Proof Give g(x) = 1 if ad oly if x is ot i the disagreemet regio of the versio space V, whe h makes a predictio, it will always agree with all hypotheses i V icludig h Thus, h achieves 0 error rate, moreover risk(h, g) = risk(h, g) Now, we will show that C achieves ear optimal coverage with probability 1 δ Let r = 4d l(2+1)+4 l 4 δ, the for ay h V, h V(h, r) ad V V(h, r) Furthermore, if h V(h, r), h B(h, r) i the realizable settig Also, we kow that r (0, 1), B(h, r) θr Therefore, with probability at least 1 δ, V B(h, r) θr ad cover(h, g) = 1 V 1 θr = 1 θ 4d l(2 + 1) + 4 l 4 δ 3 The Noisy Settig I the oisy settig, the labels are correspodig to the predictio of target hypothesis h with oises We will show that with high probability, the selective classifier produced by Algorithm 4 achieves same error rate as h whe it makes predictio, ad its coverage is ear optimal Algorithm 4 Selective Classifier Strategy - Noisy iput labelled data S, d, δ Output a selective classifier (h, g) st risk(h, g) = risk(h, g) with probability 1 δ 1: Set ĥ = ERM(H, S) so that ĥ is ay empirical risk miimizer from H 2: Set G = ˆV(ĥ, 4 2e d l( d 2 )+l 8 δ ) 3: Costruct g st g(x) = 1 if ad oly if x {X \ DIS(G)} 4: Set h = ĥ Before we prove the error boud ad coverage of Algorithm 4, let s itroduce some ew defiitios: 5

6 We would like to geeralize the defiitio of risk usig a loss fuctio L(Y, Y), the risk(h, g) = E[L(h(X), Y )g(x)] cover(h, g) Excess loss class is defied as F = {L(h(x), y) L(h (x), y) : h H} Give 0 β 1 ad B 1, class F is said to be a (β, B)-Berstei class with respect to D, if every f F satisfies E[f 2 ] B(E[f]) β Lemma 1 If F is a (β, B)-Berstei class with respect to D, the for ay r > 0, Proof If h V(h, r), by defiitio of V Also V(h, r) B(h, Br β ) E[1 {h(x) Y } ] E[1 {h (X) Y }] + r E[1 {h(x) Y } 1 {h (X) Y }] r E[1 {h(x) h (X)}] = E[ 1 {h(x) Y } 1 {h (X) Y } ] = E[(1 {h(x) Y } 1 {h (X) Y }) 2 ] (1) = E[(L(h(X), Y ) L(h (X), Y )) 2 ] (2) = E[f 2 ] B(E[f]) β = B(E[1 {h(x) Y } 1 {h (X) Y }]) β Br β From (1) to (2) we assume usig 0-1 loss fuctio Ituitively, we ca imagie that B(h, Br β ) as havig a bigger tolerace such that the ball icludes V(h, r) as the 2D visualizatio Figure 1 show below Theorem 5 (Lemma 1 from [1]) For ay δ > 0, if H has VC dimesio d, with probability at least 1 δ 2e d l( d h H, err P (h) err S (h) ) + l 2 δ Similarly, uder the same coditio, h H, err S (h) err P (h) e d l( ) + l 2 d δ

Figure 1: 2D visualizatio of ituitio behid Lemma 1 Left: The yellow regio depicts the hypothesis class H The yellow regio that lies iside the gree dotted circle represets the hypotheses i V(h, r)

differece i defiitio, the red dotted circle ad gree dotted circle have differet ceters Lemma 2 For ay r > 0, 0 < δ < 1, with probability at least 1 δ, ˆV(ĥ, r) V(h, 2σ(, δ/2, d) + r) where σ(, δ, d)

7 Figure 1: 2D visualizatio of ituitio behid Lemma 1 Left: The yellow regio depicts the hypothesis class H The yellow regio that lies iside the gree dotted circle represets the hypotheses i V(h, r) Right: The yellow regio iside the red dotted circle represets the hypotheses i B(h, Br β ) By Lemma 1 we show that the red dotted circle fully cotais the gree dotted circle Also, because the differece i defiitio, the red dotted circle ad gree dotted circle have differet ceters Lemma 2 For ay r > 0, 0 < δ < 1, with probability at least 1 δ, ˆV(ĥ, r) V(h, 2σ(, δ/2, d) + r) where σ(, δ, d) = 2 2 2e d l( ) + l 2 d δ Proof If h ˆV(ĥ, r), the err S (h) err S (ĥ) + r (*) Accordig to Theorem 5, with probability at least 1 δ, err S (ĥ) err S(h ) (**) err P (h) err S (h) + σ(, δ/2, d) err S (h ) err P (h ) + σ(, δ/2, d) So with probability at least 1 δ, err P (h) + err S (h ) err S (h) + err P (h ) + 2σ(, δ/2, d) err P (h) + err S (h ) err S (ĥ) + r + err P (h ) + 2σ(, δ/2, d) [applyig (*)] err P (h) + err S (ĥ) err S(ĥ) + r + err P (h ) + 2σ(, δ/2, d) [applyig (**)] err P (h) err P (h ) + 2σ(, δ/2, d) + r err P (h) B(h, 2σ(, δ/2, d) + r) 7

Similarly, we visualize the ituitio below i Figure 2 Usig Theorem 5, we costruct a slightly bigger ball aroud h that icludes the empirical error ball Figure 2: 2D visualizatio of ituitio behid Lemma

r + ) I Lemma 2, we show that whe we set = 2σ(, δ/2, d), with probability at least 1 δ, hypotheses i ˆV(ĥ, r) are also i V(h, r + ) Lemma 3 Assume that H has disagreemet coefficiet θ, ad F is a (β,

8 Similarly, we visualize the ituitio below i Figure 2 Usig Theorem 5, we costruct a slightly bigger ball aroud h that icludes the empirical error ball Figure 2: 2D visualizatio of ituitio behid Lemma 2 Left: The blue regio, which is the itersectio of the gree dotted circle ad H, represets the hypotheses i ˆV(ĥ, r) Right: The itersectio of H ad the red dotted circle represets the hypotheses i V(h, r + ) I Lemma 2, we show that whe we set = 2σ(, δ/2, d), with probability at least 1 δ, hypotheses i ˆV(ĥ, r) are also i V(h, r + ) Lemma 3 Assume that H has disagreemet coefficiet θ, ad F is a (β, B)-Berstei class with respect to D, the for ay r > 0 ad 0 < δ < 1, with probability at least 1 δ, ˆV(ĥ, r) Bθ(2σ(, δ/2, d) + r)β Proof Combie Lemma 1 ad Lemma 2, we get with probability at least 1 δ ˆV(ĥ, r) V(h, 2σ(, δ/2, d) + r) B(h, B(2σ(, δ/2, d) + r) β ) Thus, ˆV(ĥ, r) B(h, B(2σ(, δ/2, d) + r) β ) Bθ(2σ(, δ/2, d) + r) β Theorem 6 Assume that H has disagreemet coefficiet θ ad that F is said to be a (β, B)- Berstei class with respect to D Let (h, g) be selective classifier output by Algorithm 4 the for ay r > 0 ad 0 < δ < 1, with probability at least 1 δ, cover(h, g) 1 θb(4σ(, δ/4, d) + r) β err P (h ) = err P (h) 8

9 Proof By Theorem 5, with probability at least 1 δ/2, err S (h ) err P (h ) + σ(, δ/4, d) err P (ĥ) errs(ĥ) + σ(, δ/4, d) The, with probability at least 1 δ/2, err S (h ) err S (ĥ) + 2σ(, δ/4, d), which implies h ˆV(ĥ, 2σ(, δ/4, d)) So, for ay x X, if g(x) = 1, ĥ(x) = h (x) Therefore, err P (h ) = err P (h) Fially, usig a uio boud, we ca applyig Lemma 3, we get that with probability at least 1 δ, cover(ĥ, g) = 1 G 1 θb(4σ(, δ/4, d))β err P (h ) = err P (h) Oe of the cocer with Algorithm 4 is that it is o so clear how to set β or B We kow that if the excess loss class is o-egative, the it is a (1, B)-Berstei class for ay B However, it is urealistic to assume o-egative excess loss class for arbitrary datasets Aother area may be iterestig for future research is to challege the idea of usig h for compariso i reliable learig uder the oisy settig Give that h still make mistakes, ca we do better tha h whe we decide to make a decisio give a reasoable coverage? Bibliographic otes Besides explicitly marked refereces, the cofidece-rated predictor ad CZ selective classifier are proposed by [2], ad the selective classifier strategy i both realizable settig ad oisy settig are proposed by [4] Refereces [1] Olivier Bousquet, Stéphae Bouchero, ad Gábor Lugosi Itroductio to statistical learig theory I Advaced lectures o machie learig, pages Spriger, 2004 [2] Kamalika Chaudhuri ad Chicheg Zhag Improved algorithms for cofidece-rated predictio with error guaratees 2013 [3] RL Rivest ad R Sloa A formal model of hierarchical cocept-learig If Comput, 114(1):88 114, October 1994 [4] Yair Wieer ad Ra El-Yaiv Agostic selective classificatio I Advaces i eural iformatio processig systems, pages ,

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017 Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1. What are selective classifiers? 2. The Realizable Setting 3. The Noisy Setting 1 What are selective classifiers?