Surrogate regret bounds for generalized classification performance metrics Wojciech Kotłowski Krzysztof Dembczyński Poznań University of Technology PL-SIGML, Częstochowa, 14.04.2016 1 / 36
Motivation 2 / 36
Kaggle Higgs Boson Machine Learning Challenge 3 / 36
Data set and rules Classes: signal (h τ + τ ) and background. # events # features % signal weight 250 000 30 0.17 4 / 36
Data set and rules Classes: signal (h τ + τ ) and background. # events # features % signal weight 250 000 30 0.17 Evaluation: Approximate Median Significance (AMS): ( AMS = 2(s + b + 10) log 1 + s ) s b + 10 s, b weight of signal/background events classified as signal. 4 / 36
How to optimize AMS? Research Problem How to optimize a global function of true/false positives/negatives, not decomposable into individual losses over the observations? 5 / 36
How to optimize AMS? Research Problem How to optimize a global function of true/false positives/negatives, not decomposable into individual losses over the observations? Most popular approach: Sort classifier s scores and threshold to maximize AMS. threshold classifier s score classified as negative classified as positive AMS not used while training, only for tuning the threshold. 5 / 36
How to optimize AMS? Research Problem How to optimize a global function of true/false positives/negatives, not decomposable into individual losses over the observations? Most popular approach: Sort classifier s scores and threshold to maximize AMS. threshold classifier s score classified as negative classified as positive AMS not used while training, only for tuning the threshold. Is this approach theoretically justified? 5 / 36
Optimization of generalized performance metrics Wojciech Kotłowski and Krzysztof Dembczyński. Surrogate regret bounds for generalized classification performance metrics. In ACML, volume 45 of JMLR W&C Proc., pages 301 316, 2015 [Best Paper Award] Wojciech Kotłowski. Consistent optimization of AMS by logistic loss minimization. In NIPS HEPML Workshop, volume 42 of JMLR W&C Proc., pages 99 108, 2015 6 / 36
Generalized performance metrics for binary classification Given a binary classifier h: X { 1, 1}, define: ( ) Ψ(h) = Ψ FP(h), FN(h), where: FP(h) = Pr(h(x) = 1 y = 1), FN(h) = Pr(h(x) = 1 y = 1). true y predicted ŷ = h(x) 1 +1 total 1 TN FP 1 P +1 FN TP P Goal: maximize Ψ(h). We assume Ψ is non-increasing in FP and FN. 7 / 36
Linear-fractional performance metric Definition Ψ(FP, FN) = a 0 + a 1 FP + a 2 FN b 0 + b 1 FP + b 2 FN, 8 / 36
Linear-fractional performance metric Definition Ψ(FP, FN) = a 0 + a 1 FP + a 2 FN b 0 + b 1 FP + b 2 FN, Examples Accuracy Acc = 1 FN FP F β -measure F β = (1+β2 )(P FN) (1+β 2 )P FN+FP Jaccard similarity J = P FN P +FP AM measure AM = 1 1 2P FN 1 2(1 P ) FP Weighted accuracy WA = 1 w FP w + FN 8 / 36
Convex performance metrics Definition Ψ(FP, FN) is jointly convex in FP and FN. 9 / 36
Convex performance metrics Definition Ψ(FP, FN) is jointly convex in FP and FN. Example: AMS 2 score ( ( AMS 2 (TP, FP) = 2 (TP + FP) log 1 + TP ) ) TP. FP 9 / 36
Example - F 1 -measure F FP FN 10 / 36
Example - AMS 2 score AMS FP FN 11 / 36
A simple approach to optimization of Ψ(h) training data
A simple approach to optimization of Ψ(h) training data f(x) learn real-valued f(x) (using standard classification tool)
A simple approach to optimization of Ψ(h) training data validation data learn real-valued f(x) (using standard classification tool) f(x)
A simple approach to optimization of Ψ(h) training data validation data learn real-valued f(x) (using standard classification tool) f(x) learn a threshold θ on f(x) by optimizing Ψ(h f,θ ) θ h f,θ (x) = 1 h f,θ (x) = +1 f(x) 12 / 36
Our results Algorithm training set f(x) θ validation set h f,θ (x) 1 Learn f minimizing a surrogate loss on the training sample. 2 Given f, tune a threshold θ on f on a the validation sample by direct optimization of Ψ. 13 / 36
Our results Algorithm training set f(x) θ validation set h f,θ (x) 1 Learn f minimizing a surrogate loss on the training sample. 2 Given f, tune a threshold θ on f on a the validation sample by direct optimization of Ψ. Our results (informally) Assumptions: Claim: the surrogate loss is strongly proper composite (e.g., logistic, exponential, squared-error loss), Ψ is linear-fractional or jointly convex, If f is close to the minimizer of the surrogate loss, then h f,θ is close to the maximizer of Ψ. 13 / 36
Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h 14 / 36
Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h Surrogate loss l(y, f(x)) of a real-valued function f : X R. Used in training: logistic loss, squared loss, hinge loss,... 14 / 36
Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h Surrogate loss l(y, f(x)) of a real-valued function f : X R. Used in training: logistic loss, squared loss, hinge loss,... Expected loss (l-risk) of f: Risk l (f) = E (x,y) [l(y, f(x))]. 14 / 36
Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h Surrogate loss l(y, f(x)) of a real-valued function f : X R. Used in training: logistic loss, squared loss, hinge loss,... Expected loss (l-risk) of f: Risk l (f) = E (x,y) [l(y, f(x))]. l-regret of f: Reg l (f) = Risk l (f) Risk l (f ) where f = argmin Risk l (f). f 14 / 36
Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h Surrogate loss l(y, f(x)) of a real-valued function f : X R. Used in training: logistic loss, squared loss, hinge loss,... Relate Ψ-regret of h f,θ to l-regret of f Expected loss (l-risk) of f: Risk l (f) = E (x,y) [l(y, f(x))]. l-regret of f: Reg l (f) = Risk l (f) Risk l (f ) where f = argmin Risk l (f). f 14 / 36
Examples of surrogate losses Logistic loss Risk minimizer f (x): f (x) = log ( l(y, ŷ) = log 1 + e yŷ). η(x), where η(x) = Pr(y = 1 x). 1 η(x) Invertible function of conditional probability η(x). 15 / 36
Examples of surrogate losses Logistic loss Risk minimizer f (x): f (x) = log ( l(y, ŷ) = log 1 + e yŷ). η(x), where η(x) = Pr(y = 1 x). 1 η(x) Invertible function of conditional probability η(x). Hinge loss l(y, ŷ) = (1 yŷ) +. Its risk minimizer f (x) is non-invertible: f (x) = sgn(η(x) 1/2). 15 / 36
Examples of surrogate losses loss 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0/1 loss squared error loss logistic loss hinge loss exponential loss 3 2 1 0 1 2 3 4 yf(x) 16 / 36
Examples of surrogate losses loss f (η) = ψ(η) η(f ) = ψ 1 (f ) squared error 2η 1 logistic log η 1 η exponential 1 2 log η 1 η 1+f 2 1 1+e f 1 1+e 2f hinge sgn(η 1/2) doesn t exist 17 / 36
Proper composite losses l(y, f) is proper composite if there exists a strictly increasing link function ψ, such that: f (x) = ψ(η(x)), where η(x) = Pr(y = 1 x). Minimizing proper composite losses implies probability estimation. 18 / 36
Strongly proper composite losses [Agarwal, 2014] l(y, f) is λ-strongly proper composite if it is proper composite and for any f, x, and distribution y x: E y x [l(y, f(x)) l(y, f (x))] λ 2 ( η(x) ψ 1 (f(x)) ) 2. Technical condition. 19 / 36
Strongly proper composite losses: Examples loss f (η) = ψ(η) η(f ) = ψ 1 (f ) λ squared error 2η 1 logistic log η 1 η 1 exponential 2 log η 1 η 1+f 2 8 1 1+e f 4 1 1+e 2f 4 20 / 36
Main result Theorem for linear fractional measures If: Ψ(FP, FN) is linear-fractional, non-increasing in FP and FN, l is λ-strongly proper composite, Then, there exists a threshold θ, such that for any real-valued function f, Reg Ψ (h f,θ ) C Ψ 2 λ Regl (f). 21 / 36
Main result Theorem for linear fractional measures If: Ψ(FP, FN) is linear-fractional, non-increasing in FP and FN, l is λ-strongly proper composite, Then, there exists a threshold θ, such that for any real-valued function f, Reg Ψ (h f,θ ) C Ψ 2 λ Regl (f). metric C Ψ F β -measure 1+β 2 β 2 P Jaccard similarity J +1 P AM measure 1 2P (1 P ) 21 / 36
Main result Theorem for linear fractional measures If: Ψ(FP, FN) is linear-fractional, non-increasing in FP and FN, l is λ-strongly proper composite, Then, theresimilar exists atheorem thresholdfor θ, convex such that performance any real-valued function f, metrics (such as AMS 2 ) Reg Ψ (h f,θ ) C Ψ 2 λ Regl (f). metric F β -measure C Ψ 1+β 2 β 2 P Jaccard similarity J +1 P AM measure 1 2P (1 P ) 21 / 36
Explanation of the theorem The maximizer of Ψ is h (x) = sgn(η(x) η ) for some η. 22 / 36
Explanation of the theorem The maximizer of Ψ is h (x) = sgn(η(x) η ) for some η. The minimizer of l is f (x) = ψ(η(x)) for invertible ψ. 22 / 36
Explanation of the theorem The maximizer of Ψ is h (x) = sgn(η(x) η ) for some η. The minimizer of l is f (x) = ψ(η(x)) for invertible ψ. Thresholding f (x) at θ = ψ(η ) Thresholding η(x) at η. θ = ψ(η ) -3.5-2.5-2 -1-0.5 0.5 1.5 2 3 f (x) 0.03 0.08 0.12 0.27 0.38 η 0.62 0.82 0.88 0.95 η(x) = 1 1+e f (x) 22 / 36
Explanation of the theorem The maximizer of Ψ is h (x) = sgn(η(x) η ) for some η. The minimizer of l is f (x) = ψ(η(x)) for invertible ψ. Thresholding f (x) at θ = ψ(η ) Thresholding η(x) at η. θ = ψ(η ) -3.5-2.5-2 -1-0.5 0.5 1.5 2 3 f (x) 0.03 0.08 0.12 0.27 0.38 η 0.62 0.82 0.88 0.95 η(x) = 1 1+e f (x) Gradients of Ψ and λ measure local variations of Ψ and l when f is not equal to f. 22 / 36
The optimal threshold θ Ψ = classification accuracy = θ = 0 (η = 1 2 ). 23 / 36
The optimal threshold θ Ψ = classification accuracy = θ = 0 (η = 1 2 ). Ψ = weighted accuracy = η = w w + + w. 23 / 36
The optimal threshold θ Ψ = classification accuracy = θ = 0 (η = 1 2 ). Ψ = weighted accuracy = η = w w + + w. More complex Ψ = θ unknown (depends on Ψ )... 23 / 36
The optimal threshold θ Ψ = classification accuracy = θ = 0 (η = 1 2 ). Ψ = weighted accuracy = η = w w + + w. More complex Ψ = θ unknown (depends on Ψ )... = estimate θ from validation data 23 / 36
Tuning the threshold Corollary Given real-valued function f, validation sample of size m, let: ˆθ = argmax θ Ψ(h f,θ ) ( Ψ = estimate of Ψ) Then, under the same assumptions and notation: ( ) 2 Reg Ψ (h f,ˆθ) C Ψ λ Regl (f) + O 1 m. 24 / 36
Tuning the threshold Corollary Given real-valued function f, validation sample of size m, let: ˆθ = argmax θ Ψ(h f,θ ) ( Ψ = estimate of Ψ) Then, under the same assumptions and notation: ( ) 2 Reg Ψ (h f,ˆθ) C Ψ λ Regl (f) + O 1 m. Learning standard binary classifier and tuning the threshold afterwards is able to recover the maximizer of Ψ in the limit. 24 / 36
Multilabel classification A vector of m labels y = (y 1,..., y m ) for each x. Multilabel classifier h(x) = (h 1 (x),..., h m (x)). False positive/negative rates for each label: FP i (h i ) = Pr(h i = 1, y i = 1), FN i (h i ) = Pr(h i = 1, y i = 1). 25 / 36
Multilabel classification A vector of m labels y = (y 1,..., y m ) for each x. Multilabel classifier h(x) = (h 1 (x),..., h m (x)). False positive/negative rates for each label: FP i (h i ) = Pr(h i = 1, y i = 1), FN i (h i ) = Pr(h i = 1, y i = 1). How to use binary classification Ψ in the multilabel setting? We extend our bounds to cover micro- and macro-averaging. 25 / 36
Micro- and macro-averaging Macro-averaging Average outside Ψ: Ψ macro (h) = m Ψ ( FP i (h i ), FN i (h i ) ). i=1 Our bound suggests that a separate threshold needs to be tuned for each label. 26 / 36
Micro- and macro-averaging Macro-averaging Average outside Ψ: Ψ macro (h) = m Ψ ( FP i (h i ), FN i (h i ) ). i=1 Our bound suggests that a separate threshold needs to be tuned for each label. Micro-averaging Average inside Ψ: ( 1 m Ψ micro (h) = Ψ FP i (h i ), 1 m m i=1 ) m FN i (h i ). i=1 Our bound suggests that all labels share a single threshold. 26 / 36
Experiments 27 / 36
Experimental results Two synthetic and two benchmark data sets. Surrogates: Logistic loss (LR) and hinge loss (SVM). Performance metrics: F-measure and AM measure. Minimize l (logistic or hinge) on training data + tune the threshold ˆθ on validation data (optimizing the metrics). 28 / 36
Synthetic experiment I X = {1, 2,..., 25} with Pr(x) uniform. For each x X, η(x) Unif[0, 1]. logistic loss surrogate (converges as expected) Regret of F measure, the AM measure, and logistic loss 0.00 0.02 0.04 0.06 0.08 0.10 Logistic regret F measure regret AM regret 0 2000 4000 6000 8000 10000 # of training examples hinge loss surrogate (does not converge) Regret of F measure, the AM measure and hinge loss 0.00 0.02 0.04 0.06 0.08 0.10 Hinge regret F measure regret AM regret 0 2000 4000 6000 8000 10000 # of training examples 29 / 36
Synthetic experiment II X = R 2, x N(0, I). Logistic model: η(x) = (1 + exp( a 0 a x)) 1. Convergence of F-measure Regret of the F measure and surrogate losses 0.00 0.05 0.10 0.15 0.20 Logistic regret (LR) Hinge regret (SVM) F measure regret (LR) F measure regret (SVM) 0 500 1000 1500 2000 2500 3000 # of training examples Convergence of AM measure Regret of the AM measure and surrogate losses 0.00 0.05 0.10 0.15 0.20 Logistic regret (LR) Hinge regret (SVM) AM regret (LR) AM regret (SVM) 0 500 1000 1500 2000 2500 3000 # of training examples 30 / 36
Benchmark data experiment a bit of surprise dataset #examples #features covtype.binary 581,012 54 gisette 7,000 5,000 covtype.binary gisette F measure 0.65 0.70 0.75 0.80 F measure (LR) F measure (SVM) F measure 0.88 0.92 0.96 1.00 F measure (LR) F measure (SVM) 0 50000 100000 150000 200000 # of training examples 0 1000 2000 3000 4000 # of training examples AM 0.60 0.65 0.70 0.75 0.80 AM (LR) AM (SVM) AM 0.85 0.90 0.95 1.00 AM (LR) AM (SVM) 0 50000 100000 150000 200000 # of training examples 0 1000 2000 3000 4000 # of training examples Logistic loss as expected, but also hinge loss surprisingly well. 31 / 36
Multilabel classification data set # labels # training examples # test examples #features scene 6 1211 1169 294 yeast 14 1500 917 103 mediamill 101 30993 12914 120 Surrogates: Logistic loss (LR) and hinge loss (SVM). Performance metrics: F-measure and AM measure. Macro-averaging (separate threshold for each label) and micro-averaging (single threshold for all labels). Cross-evaluation of algorithms tuned for micro-averaging in terms of macro-averaged metrics, and vice versa. 32 / 36
Multilabel classification: F measure F-macro F-micro scene Macro F measure 0.50 0.55 0.60 0.65 0.70 LR Macro F SVM Macro F LR Micro F SVM Micro F Micro F measure 0.50 0.55 0.60 0.65 0.70 LR Macro F SVM Macro F LR Micro F SVM Micro F 200 400 600 800 1000 1200 # of training examples 200 400 600 800 1000 1200 # of training examples yeast Macro F measure 0.25 0.30 0.35 0.40 0.45 0.50 LR Macro F SVM Macro F LR Micro F SVM Micro F Micro F measure 0.50 0.55 0.60 0.65 LR Macro F SVM Macro F LR Micro F SVM Micro F 200 400 600 800 1000 1200 1400 # of training examples 200 400 600 800 1000 1200 1400 # of training examples mediamill Macro F measure 0.00 0.05 0.10 0.15 0.20 LR Macro F SVM Macro F LR Micro F SVM Micro F Micro F measure 0.2 0.3 0.4 0.5 0.6 LR Macro F SVMs Macro F LR Micro F SVM Micro F 0 5000 10000 15000 20000 25000 30000 # of training examples 0 5000 10000 15000 20000 25000 30000 # of training examples 33 / 36
Multilabel classification: AM measure AM-macro AM-micro scene Macro AM 0.70 0.75 0.80 0.85 LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM Micro AM 0.70 0.75 0.80 0.85 LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM 200 400 600 800 1000 1200 # of training examples 200 400 600 800 1000 1200 # of training examples yeast Macro AM 0.54 0.56 0.58 0.60 0.62 LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM Micro AM 0.50 0.55 0.60 0.65 0.70 0.75 LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM 200 400 600 800 1000 1200 1400 # of training examples 200 400 600 800 1000 1200 1400 # of training examples mediamill Macro AM 0.45 0.50 0.55 0.60 0.65 0.70 LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM Micro AM 0.60 0.70 0.80 0.90 LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM 0 5000 10000 15000 20000 25000 30000 # of training examples 0 5000 10000 15000 20000 25000 30000 # of training examples 34 / 36
Open problem Ψ(h ) Ψ(h) const Risk l (f) Risk l (f ) f is the minimizer over all functions 35 / 36
Open problem Ψ(h ) Ψ(h) const Risk l (f) Risk l (f ) f is the minimizer over all functions We often optimize within some class F (e.g., linear functions). If f / F r.h.s. does not converge to 0. This can be beneficial non non-proper losses (e.g., hinge loss). Most often f / F, and a theory for this is needed. 35 / 36
Summary Theoretical analysis of the two-step approach to optimize generalized performance metrics for classification. Regret bounds for linear-fractional and convex functions optimized by means of strongly proper composite surrogates. The theorem relates convergence to the global risk minimizer. Can we say anything about convergence to the risk minimizer within some class of functions? Why does hinge loss perform so well if the risk minimizer is outside the family of classification functions. 36 / 36