Pointwise Exact Bootstrap Distributions of Cost Curves

Size: px

Start display at page:

Download "Pointwise Exact Bootstrap Distributions of Cost Curves"

Matilda Caldwell
6 years ago
Views:

1 Pointwise Exact Bootstrap Distributions of Cost Curves Charles Dugas and David Gadoury University of Montréal 25th ICML Helsinki July 2008 Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

2 Outline 1 Introduction 2 ROC Curves 3 Cost Curves 4 Out-of-sample performance measure 5 Derivations of confidence intervals 6 Numerical results 7 Discussion Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

3 Introduction Goal: identifying the presence of a certain condition (e.g. fraud, malignant tumors, defective part, etc.), given a set of features, i.e. binary classification. Model: outputs a continuous score s for each example of a set. Higher s means higher chances that condition is present. Out-of-sample (OOS) performance scalars: error rate (accuracy), AUC, etc. curves: ROC, Cost curves Confidence intervals pointwise: not bands one or two models. Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

4 ROC Curves Threshold t: instance labelled as positive (s t) or negative (s < t). Scalar measures Decision Truth Positive Negative Positive True positives False positives Negative False negatives True negatives aggregate performance over all thresholds arbitrary weighting of two error types (FN and FP) True positive rate (tpr) = False positive rate (fpr) = #True positives #Positives #False positives #Negatives Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

5 Illustration of ROC Curves ROC curve: plot of true positive rate (tpr) against false positive rate (fpr) for different thresholds. Score densities ROC curve 1 TP rate 0 1 TP rate 0 1 TP rate FP rate Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

6 ROC Curves ROC pros and cons curve is independent of prior class probabilities curve is independent of cost values fails to address the real issue: expected cost (measure, view, minimize, compare, etc.) See: ICML 04 tutorial [Flach, 2004], intro paper [Fawcett, 2006] Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

7 Cost Curves [Drummond and Holte, 2000],[Drummond and Holte, 2006] Operating conditions: misclassification costs (c +, c + ) prior probabilities (p +, p ). Expected cost = p fpr c + + p + (1 tpr) c +. p + c Operating point: w = + p + c + + p c [0, 1] + Normalized cost: (1 w)fpr + w(1 tpr). Given w, we choose the pair (fpr, tpr) from the ROC curve that minimizes the normalized cost. C(w) = min (1 w)fpr + w(1 tpr) (fpr,tpr) ROC Cost curve: plot of C(w) against w [0, 1] Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

8 From ROC to Cost 0.5 True positive rate Density Score False positive rate Score False positive rate Cost Operating point (w) 0 1 Operating point (w) Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

Out-of sample performance measure Drawing cost curve involves threshold optimization Must be conducted using validation set disjoint from test set. Performance distribution from single test set?

9 Out-of sample performance measure Drawing cost curve involves threshold optimization Must be conducted using validation set disjoint from test set. Performance distribution from single test set? Empirical bootstrap: take samples of the test set, with replacement Exact bootstrap: analytic derivation for an infinite number of samples Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

10 C.I. for a single classifier s cost curve n: test set size n + (n - ): # positive (negative) instances in test set With prior probabilities p +, p (and costs) fixed, n + and n - are constant for all samples: stratified sampling. n + t (n - t ): # positive (negative) instances in test set with s t = t(w). N + t (N - t ): r.v. for # positive (negative) instances, in a given sample, with s t. TP t = N + t /n +, FP t = N - t /n - N + t Bin(n + t /n +, n + ), N - t Bin(n - t /n -, n - ) C t = w(1 TP + t ) + (1 w)fp t E[C t ] = w(1 n + t /n + ) + (1 w)n - t /n - Var[C t ] = w 2 n + t /n + (1 n + t /n + ) + (1 w) 2 n - t /n - (1 n - t /n - ) Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

11 C.I. for difference between two cost curves scores of two classifiers are dependent thresholds may have different meanings t 1 = t 1 (w), t 2 = t 2 (w). examples with s 1 t 1, s 2 t 2 have no effect on cost difference n + t 1 : # positive instances with s 1 t 1, s 2 < t 2 n + t 2, n t 1, n t 2 : defined similarly N t + 1, N t + 2, Nt 1, Nt 2 : corresponding r.v. (N t + 1, N t + 2 ) = Mult(p + t 1, p + t 2, n + ), p + t 1 = n + t 1 /n +, p + t 2 = n + t 2 /n + C t1,t 2 = C t2 C t1 = w(tp + t 1 TP + t 2 ) + (1 w)(fp t 2 FP t 1 ) E[ C t1,t 2 ] = w(p + t 1 p + t 2 ) + (1 w)(p t 2 p t 1 ) Var[ C t1,t 2 ] = w 2 [p + t 1 + p + t 2 (p + t 1 p + t 2 ) 2 ]/n + (1 w) 2 [p t 1 + p t 2 (p t 1 p t 2 ) 2 ]/n Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

12 Stratified vs Full sampling Stratified sampling: draw samples independently from two classes cost distribution, given fixed operating point Full sampling: draw samples from whole test set cost distribution, given fixed costs but binomial distribution of class proportions Full sampling has larger variance Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

13 C.I. for a single classifier s cost curve (full sampling) N + (N - ): # positive (negative) instances in test set, now r.v. c max = max(c +, c + ) C t = N+ c + (1 TP + t ) + N c + FP t n c max E[C t ] = E N +{E[C t N + ]} = c /+(n + n + t ) + c +/ n t n c max V[C t ] = V N +{E[C t N + ]} + E N +{V[C t N + ]} = c2 /+ α+ t + c 2 +/ α t + δ 2 t (n c max ) 2 α t + = n + t (n+ t ) 2, α n + t = n t (n t ) 2, ( ) n δt 2 n = c + n + t n /+ c 2 n + t n+ n +/ n n Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

14 C.I. for difference between two cost curves (full sampling) C t1,t 2 = c /+(N + t 1 N + t 2 ) + c +/ (N t 2 N t 1 ) n c max E[ C t1,t 2 ] = E N +{E[ C t1,t 2 N + ]} = c /+(n + t 1 n + t 2 ) + c +/ (n t 2 n t 1 ) n c max V[ C t1,t 2 ] = V N +{E[ C t1,t 2 N + ]} +E N +{V[ C t1,t 2 N + ]} = c2 /+ α+ t 1,t 2 + c 2 +/ α t 1,t 2 + δ 2 t 1,t 2 (n c max ) 2 α t + 1,t 2 = n + t 1 + n + t 2 (n+ t 1 n + t 2 ) 2, α n + t 1,t 2 = n t 1 + n t 2 (n t 1 n t 2 ) 2 ( ) δt 2 n 1,t 2 = c + t 1 n + t 2 n /+ c t 2 n 2 t 1 n + n n + +/ n n n, Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

15 Simulations (one curve) Scores of positive instances N(µ = 3, σ = 3) Scores of negative instances N(µ = 3, σ = 3) Thresholds set to cost minimizing according to distribution Samples of 25, 250, 2500 and drawn to compute p.e.b.c.i simulations Coverage = proportion of simul. with true curve included in C.I. α = 10%, i.e. 90% C.I. w {1, 2,..., 0.99} Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

16 Simulations (one curve - stratified sampling) Coverage Coverage Sample size = 25 Sample size = 250 Sample size = 2500 Sample size = Operating conditions Operating conditions Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

17 UCI experiments (one curve) Dataset Train Valid Test (perc. pos.) Abalone (50%) Covertype (57%) Credit (german) (69%) Telescope (magic) (65%) Logistic regression models Entire test set used to compute true cost curve Samples of 25, 250, 2500 and drawn to compute p.e.b.c.i simulations Coverage = proportion of simul. with true curve included in C.I. α = 10%, i.e. 90% C.I. w {1, 2,..., 0.99} Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

18 UCI experiments (one curve - stratified sampling) Abalone Covertype Coverage Coverage Credit Operating point (w) Telescope Operating point (w) Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

19 UCI experiments (one curve - stratified sampling) Abalone Covertype Coverage Coverage Credit Operating point (w) Telescope Operating point (w) Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

20 Simulations (two curves) Scores of positive instances, 1st model: N(µ = θ, σ = 3) Scores of positive instances, 2nd model: N(µ = θ + δ, σ = 3) Scores of negative instances N(µ = θ, σ = 3) Spread: θ = 1.0, 3.0 Shift: δ =, 2.0, 4.0 Score correlation ρ = 0.3, 0.6, 0.9 Thresholds set to cost minimizing according to distribution Sample size: simulations α = 10%, i.e. 90% C.I. w {1, 2,..., 0.99} Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

21 Simulations (two curves - stratified sampling) Spread=1.0 Spread=3.0 Shift= Shift=2.0 Shift= Operating conditions Operating conditions Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

22 Simulations (two curves - full sampling) Spread=1.0 Spread=3.0 Shift= Shift=2.0 Shift= Operating conditions Operating conditions Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

23 Discussion Cost curves are an excellent visualization tool of the true target: expected cost Provided means to compute confidence intervals of cost curves for Stratified or full sampling One or two curves Fast: O(n log n) (once sorted, everything is linear) Empirical method, can not extrapolate. Solutions against breaks: kernels, tail distribution estimation Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

24 References Drummond, C. and Holte, R. (2000). Explicitly representing expected cost: an alternative to ROC representation. In KDD 00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM. Drummond, C. and Holte, R. (2006). Cost curves: an improved method for visualizing classifier performance. Machine Learning, 65(1): Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8): Flach, P. (2004). The many faces of ROC analysis in machine learning. Dugas, Gadoury (U Montréal) Cost curves July 8, / 24

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation