Surrogate regret bounds for generalized classification performance metrics

Size: px

Start display at page:

Download "Surrogate regret bounds for generalized classification performance metrics"

Fay Wheeler
6 years ago
Views:

1 Surrogate regret bounds for generalized classification performance metrics Wojciech Kotłowski Krzysztof Dembczyński Poznań University of Technology PL-SIGML, Częstochowa, / 36

2 Motivation 2 / 36

3 Kaggle Higgs Boson Machine Learning Challenge 3 / 36

4 Data set and rules Classes: signal (h τ + τ ) and background. # events # features % signal weight / 36

5 Data set and rules Classes: signal (h τ + τ ) and background. # events # features % signal weight Evaluation: Approximate Median Significance (AMS): ( AMS = 2(s + b + 10) log 1 + s ) s b + 10 s, b weight of signal/background events classified as signal. 4 / 36

6 How to optimize AMS? Research Problem How to optimize a global function of true/false positives/negatives, not decomposable into individual losses over the observations? 5 / 36

7 How to optimize AMS? Research Problem How to optimize a global function of true/false positives/negatives, not decomposable into individual losses over the observations? Most popular approach: Sort classifier s scores and threshold to maximize AMS. threshold classifier s score classified as negative classified as positive AMS not used while training, only for tuning the threshold. 5 / 36

8 How to optimize AMS? Research Problem How to optimize a global function of true/false positives/negatives, not decomposable into individual losses over the observations? Most popular approach: Sort classifier s scores and threshold to maximize AMS. threshold classifier s score classified as negative classified as positive AMS not used while training, only for tuning the threshold. Is this approach theoretically justified? 5 / 36

9 Optimization of generalized performance metrics Wojciech Kotłowski and Krzysztof Dembczyński. Surrogate regret bounds for generalized classification performance metrics. In ACML, volume 45 of JMLR W&C Proc., pages , 2015 [Best Paper Award] Wojciech Kotłowski. Consistent optimization of AMS by logistic loss minimization. In NIPS HEPML Workshop, volume 42 of JMLR W&C Proc., pages , / 36

10 Generalized performance metrics for binary classification Given a binary classifier h: X { 1, 1}, define: ( ) Ψ(h) = Ψ FP(h), FN(h), where: FP(h) = Pr(h(x) = 1 y = 1), FN(h) = Pr(h(x) = 1 y = 1). true y predicted ŷ = h(x) 1 +1 total 1 TN FP 1 P +1 FN TP P Goal: maximize Ψ(h). We assume Ψ is non-increasing in FP and FN. 7 / 36

11 Linear-fractional performance metric Definition Ψ(FP, FN) = a 0 + a 1 FP + a 2 FN b 0 + b 1 FP + b 2 FN, 8 / 36

12 Linear-fractional performance metric Definition Ψ(FP, FN) = a 0 + a 1 FP + a 2 FN b 0 + b 1 FP + b 2 FN, Examples Accuracy Acc = 1 FN FP F β -measure F β = (1+β2 )(P FN) (1+β 2 )P FN+FP Jaccard similarity J = P FN P +FP AM measure AM = 1 1 2P FN 1 2(1 P ) FP Weighted accuracy WA = 1 w FP w + FN 8 / 36

13 Convex performance metrics Definition Ψ(FP, FN) is jointly convex in FP and FN. 9 / 36

14 Convex performance metrics Definition Ψ(FP, FN) is jointly convex in FP and FN. Example: AMS 2 score ( ( AMS 2 (TP, FP) = 2 (TP + FP) log 1 + TP ) ) TP. FP 9 / 36

15 Example - F 1 -measure F FP FN 10 / 36

16 Example - AMS 2 score AMS FP FN 11 / 36

17 A simple approach to optimization of Ψ(h) training data

18 A simple approach to optimization of Ψ(h) training data f(x) learn real-valued f(x) (using standard classification tool)

19 A simple approach to optimization of Ψ(h) training data validation data learn real-valued f(x) (using standard classification tool) f(x)

20 A simple approach to optimization of Ψ(h) training data validation data learn real-valued f(x) (using standard classification tool) f(x) learn a threshold θ on f(x) by optimizing Ψ(h f,θ ) θ h f,θ (x) = 1 h f,θ (x) = +1 f(x) 12 / 36

21 Our results Algorithm training set f(x) θ validation set h f,θ (x) 1 Learn f minimizing a surrogate loss on the training sample. 2 Given f, tune a threshold θ on f on a the validation sample by direct optimization of Ψ. 13 / 36

22 Our results Algorithm training set f(x) θ validation set h f,θ (x) 1 Learn f minimizing a surrogate loss on the training sample. 2 Given f, tune a threshold θ on f on a the validation sample by direct optimization of Ψ. Our results (informally) Assumptions: Claim: the surrogate loss is strongly proper composite (e.g., logistic, exponential, squared-error loss), Ψ is linear-fractional or jointly convex, If f is close to the minimizer of the surrogate loss, then h f,θ is close to the maximizer of Ψ. 13 / 36

23 Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h 14 / 36

24 Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h Surrogate loss l(y, f(x)) of a real-valued function f : X R. Used in training: logistic loss, squared loss, hinge loss, / 36

25 Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h Surrogate loss l(y, f(x)) of a real-valued function f : X R. Used in training: logistic loss, squared loss, hinge loss,... Expected loss (l-risk) of f: Risk l (f) = E (x,y) [l(y, f(x))]. 14 / 36

26 Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h Surrogate loss l(y, f(x)) of a real-valued function f : X R. Used in training: logistic loss, squared loss, hinge loss,... Expected loss (l-risk) of f: Risk l (f) = E (x,y) [l(y, f(x))]. l-regret of f: Reg l (f) = Risk l (f) Risk l (f ) where f = argmin Risk l (f). f 14 / 36

27 Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h Surrogate loss l(y, f(x)) of a real-valued function f : X R. Used in training: logistic loss, squared loss, hinge loss,... Relate Ψ-regret of h f,θ to l-regret of f Expected loss (l-risk) of f: Risk l (f) = E (x,y) [l(y, f(x))]. l-regret of f: Reg l (f) = Risk l (f) Risk l (f ) where f = argmin Risk l (f). f 14 / 36

28 Examples of surrogate losses Logistic loss Risk minimizer f (x): f (x) = log ( l(y, ŷ) = log 1 + e yŷ). η(x), where η(x) = Pr(y = 1 x). 1 η(x) Invertible function of conditional probability η(x). 15 / 36

29 Examples of surrogate losses Logistic loss Risk minimizer f (x): f (x) = log ( l(y, ŷ) = log 1 + e yŷ). η(x), where η(x) = Pr(y = 1 x). 1 η(x) Invertible function of conditional probability η(x). Hinge loss l(y, ŷ) = (1 yŷ) +. Its risk minimizer f (x) is non-invertible: f (x) = sgn(η(x) 1/2). 15 / 36

30 Examples of surrogate losses loss /1 loss squared error loss logistic loss hinge loss exponential loss yf(x) 16 / 36

31 Examples of surrogate losses loss f (η) = ψ(η) η(f ) = ψ 1 (f ) squared error 2η 1 logistic log η 1 η exponential 1 2 log η 1 η 1+f e f 1 1+e 2f hinge sgn(η 1/2) doesn t exist 17 / 36

32 Proper composite losses l(y, f) is proper composite if there exists a strictly increasing link function ψ, such that: f (x) = ψ(η(x)), where η(x) = Pr(y = 1 x). Minimizing proper composite losses implies probability estimation. 18 / 36

33 Strongly proper composite losses [Agarwal, 2014] l(y, f) is λ-strongly proper composite if it is proper composite and for any f, x, and distribution y x: E y x [l(y, f(x)) l(y, f (x))] λ 2 ( η(x) ψ 1 (f(x)) ) 2. Technical condition. 19 / 36

34 Strongly proper composite losses: Examples loss f (η) = ψ(η) η(f ) = ψ 1 (f ) λ squared error 2η 1 logistic log η 1 η 1 exponential 2 log η 1 η 1+f e f e 2f 4 20 / 36

35 Main result Theorem for linear fractional measures If: Ψ(FP, FN) is linear-fractional, non-increasing in FP and FN, l is λ-strongly proper composite, Then, there exists a threshold θ, such that for any real-valued function f, Reg Ψ (h f,θ ) C Ψ 2 λ Regl (f). 21 / 36

36 Main result Theorem for linear fractional measures If: Ψ(FP, FN) is linear-fractional, non-increasing in FP and FN, l is λ-strongly proper composite, Then, there exists a threshold θ, such that for any real-valued function f, Reg Ψ (h f,θ ) C Ψ 2 λ Regl (f). metric C Ψ F β -measure 1+β 2 β 2 P Jaccard similarity J +1 P AM measure 1 2P (1 P ) 21 / 36

37 Main result Theorem for linear fractional measures If: Ψ(FP, FN) is linear-fractional, non-increasing in FP and FN, l is λ-strongly proper composite, Then, theresimilar exists atheorem thresholdfor θ, convex such that performance any real-valued function f, metrics (such as AMS 2 ) Reg Ψ (h f,θ ) C Ψ 2 λ Regl (f). metric F β -measure C Ψ 1+β 2 β 2 P Jaccard similarity J +1 P AM measure 1 2P (1 P ) 21 / 36

38 Explanation of the theorem The maximizer of Ψ is h (x) = sgn(η(x) η ) for some η. 22 / 36

39 Explanation of the theorem The maximizer of Ψ is h (x) = sgn(η(x) η ) for some η. The minimizer of l is f (x) = ψ(η(x)) for invertible ψ. 22 / 36

40 Explanation of the theorem The maximizer of Ψ is h (x) = sgn(η(x) η ) for some η. The minimizer of l is f (x) = ψ(η(x)) for invertible ψ. Thresholding f (x) at θ = ψ(η ) Thresholding η(x) at η. θ = ψ(η ) f (x) η η(x) = 1 1+e f (x) 22 / 36

41 Explanation of the theorem The maximizer of Ψ is h (x) = sgn(η(x) η ) for some η. The minimizer of l is f (x) = ψ(η(x)) for invertible ψ. Thresholding f (x) at θ = ψ(η ) Thresholding η(x) at η. θ = ψ(η ) f (x) η η(x) = 1 1+e f (x) Gradients of Ψ and λ measure local variations of Ψ and l when f is not equal to f. 22 / 36

42 The optimal threshold θ Ψ = classification accuracy = θ = 0 (η = 1 2 ). 23 / 36

43 The optimal threshold θ Ψ = classification accuracy = θ = 0 (η = 1 2 ). Ψ = weighted accuracy = η = w w + + w. 23 / 36

44 The optimal threshold θ Ψ = classification accuracy = θ = 0 (η = 1 2 ). Ψ = weighted accuracy = η = w w + + w. More complex Ψ = θ unknown (depends on Ψ ) / 36

45 The optimal threshold θ Ψ = classification accuracy = θ = 0 (η = 1 2 ). Ψ = weighted accuracy = η = w w + + w. More complex Ψ = θ unknown (depends on Ψ )... = estimate θ from validation data 23 / 36

46 Tuning the threshold Corollary Given real-valued function f, validation sample of size m, let: ˆθ = argmax θ Ψ(h f,θ ) ( Ψ = estimate of Ψ) Then, under the same assumptions and notation: ( ) 2 Reg Ψ (h f,ˆθ) C Ψ λ Regl (f) + O 1 m. 24 / 36

47 Tuning the threshold Corollary Given real-valued function f, validation sample of size m, let: ˆθ = argmax θ Ψ(h f,θ ) ( Ψ = estimate of Ψ) Then, under the same assumptions and notation: ( ) 2 Reg Ψ (h f,ˆθ) C Ψ λ Regl (f) + O 1 m. Learning standard binary classifier and tuning the threshold afterwards is able to recover the maximizer of Ψ in the limit. 24 / 36

48 Multilabel classification A vector of m labels y = (y 1,..., y m ) for each x. Multilabel classifier h(x) = (h 1 (x),..., h m (x)). False positive/negative rates for each label: FP i (h i ) = Pr(h i = 1, y i = 1), FN i (h i ) = Pr(h i = 1, y i = 1). 25 / 36

49 Multilabel classification A vector of m labels y = (y 1,..., y m ) for each x. Multilabel classifier h(x) = (h 1 (x),..., h m (x)). False positive/negative rates for each label: FP i (h i ) = Pr(h i = 1, y i = 1), FN i (h i ) = Pr(h i = 1, y i = 1). How to use binary classification Ψ in the multilabel setting? We extend our bounds to cover micro- and macro-averaging. 25 / 36

50 Micro- and macro-averaging Macro-averaging Average outside Ψ: Ψ macro (h) = m Ψ ( FP i (h i ), FN i (h i ) ). i=1 Our bound suggests that a separate threshold needs to be tuned for each label. 26 / 36

51 Micro- and macro-averaging Macro-averaging Average outside Ψ: Ψ macro (h) = m Ψ ( FP i (h i ), FN i (h i ) ). i=1 Our bound suggests that a separate threshold needs to be tuned for each label. Micro-averaging Average inside Ψ: ( 1 m Ψ micro (h) = Ψ FP i (h i ), 1 m m i=1 ) m FN i (h i ). i=1 Our bound suggests that all labels share a single threshold. 26 / 36

52 Experiments 27 / 36

53 Experimental results Two synthetic and two benchmark data sets. Surrogates: Logistic loss (LR) and hinge loss (SVM). Performance metrics: F-measure and AM measure. Minimize l (logistic or hinge) on training data + tune the threshold ˆθ on validation data (optimizing the metrics). 28 / 36

54 Synthetic experiment I X = {1, 2,..., 25} with Pr(x) uniform. For each x X, η(x) Unif[0, 1]. logistic loss surrogate (converges as expected) Regret of F measure, the AM measure, and logistic loss Logistic regret F measure regret AM regret # of training examples hinge loss surrogate (does not converge) Regret of F measure, the AM measure and hinge loss Hinge regret F measure regret AM regret # of training examples 29 / 36

55 Synthetic experiment II X = R 2, x N(0, I). Logistic model: η(x) = (1 + exp( a 0 a x)) 1. Convergence of F-measure Regret of the F measure and surrogate losses Logistic regret (LR) Hinge regret (SVM) F measure regret (LR) F measure regret (SVM) # of training examples Convergence of AM measure Regret of the AM measure and surrogate losses Logistic regret (LR) Hinge regret (SVM) AM regret (LR) AM regret (SVM) # of training examples 30 / 36

56 Benchmark data experiment a bit of surprise dataset #examples #features covtype.binary 581, gisette 7,000 5,000 covtype.binary gisette F measure F measure (LR) F measure (SVM) F measure F measure (LR) F measure (SVM) # of training examples # of training examples AM AM (LR) AM (SVM) AM AM (LR) AM (SVM) # of training examples # of training examples Logistic loss as expected, but also hinge loss surprisingly well. 31 / 36

57 Multilabel classification data set # labels # training examples # test examples #features scene yeast mediamill Surrogates: Logistic loss (LR) and hinge loss (SVM). Performance metrics: F-measure and AM measure. Macro-averaging (separate threshold for each label) and micro-averaging (single threshold for all labels). Cross-evaluation of algorithms tuned for micro-averaging in terms of macro-averaged metrics, and vice versa. 32 / 36

58 Multilabel classification: F measure F-macro F-micro scene Macro F measure LR Macro F SVM Macro F LR Micro F SVM Micro F Micro F measure LR Macro F SVM Macro F LR Micro F SVM Micro F # of training examples # of training examples yeast Macro F measure LR Macro F SVM Macro F LR Micro F SVM Micro F Micro F measure LR Macro F SVM Macro F LR Micro F SVM Micro F # of training examples # of training examples mediamill Macro F measure LR Macro F SVM Macro F LR Micro F SVM Micro F Micro F measure LR Macro F SVMs Macro F LR Micro F SVM Micro F # of training examples # of training examples 33 / 36

59 Multilabel classification: AM measure AM-macro AM-micro scene Macro AM LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM Micro AM LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM # of training examples # of training examples yeast Macro AM LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM Micro AM LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM # of training examples # of training examples mediamill Macro AM LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM Micro AM LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM # of training examples # of training examples 34 / 36

60 Open problem Ψ(h ) Ψ(h) const Risk l (f) Risk l (f ) f is the minimizer over all functions 35 / 36

61 Open problem Ψ(h ) Ψ(h) const Risk l (f) Risk l (f ) f is the minimizer over all functions We often optimize within some class F (e.g., linear functions). If f / F r.h.s. does not converge to 0. This can be beneficial non non-proper losses (e.g., hinge loss). Most often f / F, and a theory for this is needed. 35 / 36

62 Summary Theoretical analysis of the two-step approach to optimize generalized performance metrics for classification. Regret bounds for linear-fractional and convex functions optimized by means of strongly proper composite surrogates. The theorem relates convergence to the global risk minimizer. Can we say anything about convergence to the risk minimizer within some class of functions? Why does hinge loss perform so well if the risk minimizer is outside the family of classification functions. 36 / 36

A Simple Algorithm for Multilabel Ranking

A Simple Algorithm for Multilabel Ranking Krzysztof Dembczyński 1 Wojciech Kot lowski 1 Eyke Hüllermeier 2 1 Intelligent Decision Support Systems Laboratory (IDSS), Poznań University of Technology, Poland