Surrogate regret bounds for generalized classification performance metrics

Similar documents
A Simple Algorithm for Multilabel Ranking

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Classification objectives COMS 4771

Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss

Evaluation requires to define performance measures to be optimized

Evaluation. Andrea Passerini Machine Learning. Evaluation

Beyond Accuracy: Scalable Classification with Complex Metrics

Online Learning and Sequential Decision Making

Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

CIS 520: Machine Learning Oct 09, Kernel Methods

On the Consistency of AUC Pairwise Optimization

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

IFT Lecture 7 Elements of statistical learning theory

Consistency of Nearest Neighbor Methods

Calibrated Surrogate Losses

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Ordinal Classification with Decision Rules

A Blended Metric for Multi-label Optimisation and Evaluation

Optimal Classification with Multivariate Losses

CSCI-567: Machine Learning (Spring 2019)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Learning with Rejection

Statistical Properties of Large Margin Classifiers

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Day 4: Classification, support vector machines

Variations of Logistic Regression with Stochastic Gradient Descent

Qualifying Exam in Machine Learning

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Support Vector Machines and Kernel Methods

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Support vector machines Lecture 4

On the Bayes-Optimality of F-Measure Maximizers

Lecture 16: Perceptron and Exponential Weights Algorithm

4.1 Online Convex Optimization

Warm up: risk prediction with logistic regression

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Lecture 2 Machine Learning Review

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation

Bayesian Learning (II)

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Analysis and Optimization of Loss Functions for Multiclass, Top-k, and Multilabel Classification

Online Advertising is Big Business

A Magiv CV Theory for Large-Margin Classifiers

AdaBoost and other Large Margin Classifiers: Convexity in Classification

Machine Learning And Applications: Supervised Learning-SVM

Least Squares Regression

INTRODUCTION TO DATA SCIENCE

Statistical Data Mining and Machine Learning Hilary Term 2016

Relationship between Loss Functions and Confirmation Measures

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

18.9 SUPPORT VECTOR MACHINES

Logistic Regression. William Cohen

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Bayesian Decision Theory

Decoupled Collaborative Ranking

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Online Advertising is Big Business

Lecture 3: Multiclass Classification

Cutting Plane Training of Structural SVM

Surrogate loss functions, divergences and decentralized detection

Logistic Regression. Machine Learning Fall 2018

Lecture 8. Instructor: Haipeng Luo

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

ML4NLP Multiclass Classification

Machine Learning. Ludovic Samper. September 1st, Antidot. Ludovic Samper (Antidot) Machine Learning September 1st, / 77

Lecture 10 February 23

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

ECS171: Machine Learning

Algorithms for Predicting Structured Data

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Listwise Approach to Learning to Rank Theory and Algorithm

MLCC 2017 Regularization Networks I: Linear Models

Advanced Introduction to Machine Learning CMU-10715

Support Vector Machines, Kernel SVM

Introduction to Machine Learning (67577) Lecture 3

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Randomized Smoothing for Stochastic Optimization

ABC-LogitBoost for Multi-Class Classification

Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss

Subgradient Descent. David S. Rosenberg. New York University. February 7, 2018

Introduction to Support Vector Machines

arxiv: v1 [stat.ml] 8 Feb 2014

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Machine Learning for NLP

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Classification with Reject Option

On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

Performance Evaluation

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

On Equivalence Relationships Between Classification and Ranking Algorithms

Machine Learning and Data Mining. Linear classification. Kalev Kask

Midterm exam CS 189/289, Fall 2015

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Transcription:

Surrogate regret bounds for generalized classification performance metrics Wojciech Kotłowski Krzysztof Dembczyński Poznań University of Technology PL-SIGML, Częstochowa, 14.04.2016 1 / 36

Motivation 2 / 36

Kaggle Higgs Boson Machine Learning Challenge 3 / 36

Data set and rules Classes: signal (h τ + τ ) and background. # events # features % signal weight 250 000 30 0.17 4 / 36

Data set and rules Classes: signal (h τ + τ ) and background. # events # features % signal weight 250 000 30 0.17 Evaluation: Approximate Median Significance (AMS): ( AMS = 2(s + b + 10) log 1 + s ) s b + 10 s, b weight of signal/background events classified as signal. 4 / 36

How to optimize AMS? Research Problem How to optimize a global function of true/false positives/negatives, not decomposable into individual losses over the observations? 5 / 36

How to optimize AMS? Research Problem How to optimize a global function of true/false positives/negatives, not decomposable into individual losses over the observations? Most popular approach: Sort classifier s scores and threshold to maximize AMS. threshold classifier s score classified as negative classified as positive AMS not used while training, only for tuning the threshold. 5 / 36

How to optimize AMS? Research Problem How to optimize a global function of true/false positives/negatives, not decomposable into individual losses over the observations? Most popular approach: Sort classifier s scores and threshold to maximize AMS. threshold classifier s score classified as negative classified as positive AMS not used while training, only for tuning the threshold. Is this approach theoretically justified? 5 / 36

Optimization of generalized performance metrics Wojciech Kotłowski and Krzysztof Dembczyński. Surrogate regret bounds for generalized classification performance metrics. In ACML, volume 45 of JMLR W&C Proc., pages 301 316, 2015 [Best Paper Award] Wojciech Kotłowski. Consistent optimization of AMS by logistic loss minimization. In NIPS HEPML Workshop, volume 42 of JMLR W&C Proc., pages 99 108, 2015 6 / 36

Generalized performance metrics for binary classification Given a binary classifier h: X { 1, 1}, define: ( ) Ψ(h) = Ψ FP(h), FN(h), where: FP(h) = Pr(h(x) = 1 y = 1), FN(h) = Pr(h(x) = 1 y = 1). true y predicted ŷ = h(x) 1 +1 total 1 TN FP 1 P +1 FN TP P Goal: maximize Ψ(h). We assume Ψ is non-increasing in FP and FN. 7 / 36

Linear-fractional performance metric Definition Ψ(FP, FN) = a 0 + a 1 FP + a 2 FN b 0 + b 1 FP + b 2 FN, 8 / 36

Linear-fractional performance metric Definition Ψ(FP, FN) = a 0 + a 1 FP + a 2 FN b 0 + b 1 FP + b 2 FN, Examples Accuracy Acc = 1 FN FP F β -measure F β = (1+β2 )(P FN) (1+β 2 )P FN+FP Jaccard similarity J = P FN P +FP AM measure AM = 1 1 2P FN 1 2(1 P ) FP Weighted accuracy WA = 1 w FP w + FN 8 / 36

Convex performance metrics Definition Ψ(FP, FN) is jointly convex in FP and FN. 9 / 36

Convex performance metrics Definition Ψ(FP, FN) is jointly convex in FP and FN. Example: AMS 2 score ( ( AMS 2 (TP, FP) = 2 (TP + FP) log 1 + TP ) ) TP. FP 9 / 36

Example - F 1 -measure F FP FN 10 / 36

Example - AMS 2 score AMS FP FN 11 / 36

A simple approach to optimization of Ψ(h) training data

A simple approach to optimization of Ψ(h) training data f(x) learn real-valued f(x) (using standard classification tool)

A simple approach to optimization of Ψ(h) training data validation data learn real-valued f(x) (using standard classification tool) f(x)

A simple approach to optimization of Ψ(h) training data validation data learn real-valued f(x) (using standard classification tool) f(x) learn a threshold θ on f(x) by optimizing Ψ(h f,θ ) θ h f,θ (x) = 1 h f,θ (x) = +1 f(x) 12 / 36

Our results Algorithm training set f(x) θ validation set h f,θ (x) 1 Learn f minimizing a surrogate loss on the training sample. 2 Given f, tune a threshold θ on f on a the validation sample by direct optimization of Ψ. 13 / 36

Our results Algorithm training set f(x) θ validation set h f,θ (x) 1 Learn f minimizing a surrogate loss on the training sample. 2 Given f, tune a threshold θ on f on a the validation sample by direct optimization of Ψ. Our results (informally) Assumptions: Claim: the surrogate loss is strongly proper composite (e.g., logistic, exponential, squared-error loss), Ψ is linear-fractional or jointly convex, If f is close to the minimizer of the surrogate loss, then h f,θ is close to the maximizer of Ψ. 13 / 36

Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h 14 / 36

Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h Surrogate loss l(y, f(x)) of a real-valued function f : X R. Used in training: logistic loss, squared loss, hinge loss,... 14 / 36

Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h Surrogate loss l(y, f(x)) of a real-valued function f : X R. Used in training: logistic loss, squared loss, hinge loss,... Expected loss (l-risk) of f: Risk l (f) = E (x,y) [l(y, f(x))]. 14 / 36

Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h Surrogate loss l(y, f(x)) of a real-valued function f : X R. Used in training: logistic loss, squared loss, hinge loss,... Expected loss (l-risk) of f: Risk l (f) = E (x,y) [l(y, f(x))]. l-regret of f: Reg l (f) = Risk l (f) Risk l (f ) where f = argmin Risk l (f). f 14 / 36

Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h Surrogate loss l(y, f(x)) of a real-valued function f : X R. Used in training: logistic loss, squared loss, hinge loss,... Relate Ψ-regret of h f,θ to l-regret of f Expected loss (l-risk) of f: Risk l (f) = E (x,y) [l(y, f(x))]. l-regret of f: Reg l (f) = Risk l (f) Risk l (f ) where f = argmin Risk l (f). f 14 / 36

Examples of surrogate losses Logistic loss Risk minimizer f (x): f (x) = log ( l(y, ŷ) = log 1 + e yŷ). η(x), where η(x) = Pr(y = 1 x). 1 η(x) Invertible function of conditional probability η(x). 15 / 36

Examples of surrogate losses Logistic loss Risk minimizer f (x): f (x) = log ( l(y, ŷ) = log 1 + e yŷ). η(x), where η(x) = Pr(y = 1 x). 1 η(x) Invertible function of conditional probability η(x). Hinge loss l(y, ŷ) = (1 yŷ) +. Its risk minimizer f (x) is non-invertible: f (x) = sgn(η(x) 1/2). 15 / 36

Examples of surrogate losses loss 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0/1 loss squared error loss logistic loss hinge loss exponential loss 3 2 1 0 1 2 3 4 yf(x) 16 / 36

Examples of surrogate losses loss f (η) = ψ(η) η(f ) = ψ 1 (f ) squared error 2η 1 logistic log η 1 η exponential 1 2 log η 1 η 1+f 2 1 1+e f 1 1+e 2f hinge sgn(η 1/2) doesn t exist 17 / 36

Proper composite losses l(y, f) is proper composite if there exists a strictly increasing link function ψ, such that: f (x) = ψ(η(x)), where η(x) = Pr(y = 1 x). Minimizing proper composite losses implies probability estimation. 18 / 36

Strongly proper composite losses [Agarwal, 2014] l(y, f) is λ-strongly proper composite if it is proper composite and for any f, x, and distribution y x: E y x [l(y, f(x)) l(y, f (x))] λ 2 ( η(x) ψ 1 (f(x)) ) 2. Technical condition. 19 / 36

Strongly proper composite losses: Examples loss f (η) = ψ(η) η(f ) = ψ 1 (f ) λ squared error 2η 1 logistic log η 1 η 1 exponential 2 log η 1 η 1+f 2 8 1 1+e f 4 1 1+e 2f 4 20 / 36

Main result Theorem for linear fractional measures If: Ψ(FP, FN) is linear-fractional, non-increasing in FP and FN, l is λ-strongly proper composite, Then, there exists a threshold θ, such that for any real-valued function f, Reg Ψ (h f,θ ) C Ψ 2 λ Regl (f). 21 / 36

Main result Theorem for linear fractional measures If: Ψ(FP, FN) is linear-fractional, non-increasing in FP and FN, l is λ-strongly proper composite, Then, there exists a threshold θ, such that for any real-valued function f, Reg Ψ (h f,θ ) C Ψ 2 λ Regl (f). metric C Ψ F β -measure 1+β 2 β 2 P Jaccard similarity J +1 P AM measure 1 2P (1 P ) 21 / 36

Main result Theorem for linear fractional measures If: Ψ(FP, FN) is linear-fractional, non-increasing in FP and FN, l is λ-strongly proper composite, Then, theresimilar exists atheorem thresholdfor θ, convex such that performance any real-valued function f, metrics (such as AMS 2 ) Reg Ψ (h f,θ ) C Ψ 2 λ Regl (f). metric F β -measure C Ψ 1+β 2 β 2 P Jaccard similarity J +1 P AM measure 1 2P (1 P ) 21 / 36

Explanation of the theorem The maximizer of Ψ is h (x) = sgn(η(x) η ) for some η. 22 / 36

Explanation of the theorem The maximizer of Ψ is h (x) = sgn(η(x) η ) for some η. The minimizer of l is f (x) = ψ(η(x)) for invertible ψ. 22 / 36

Explanation of the theorem The maximizer of Ψ is h (x) = sgn(η(x) η ) for some η. The minimizer of l is f (x) = ψ(η(x)) for invertible ψ. Thresholding f (x) at θ = ψ(η ) Thresholding η(x) at η. θ = ψ(η ) -3.5-2.5-2 -1-0.5 0.5 1.5 2 3 f (x) 0.03 0.08 0.12 0.27 0.38 η 0.62 0.82 0.88 0.95 η(x) = 1 1+e f (x) 22 / 36

Explanation of the theorem The maximizer of Ψ is h (x) = sgn(η(x) η ) for some η. The minimizer of l is f (x) = ψ(η(x)) for invertible ψ. Thresholding f (x) at θ = ψ(η ) Thresholding η(x) at η. θ = ψ(η ) -3.5-2.5-2 -1-0.5 0.5 1.5 2 3 f (x) 0.03 0.08 0.12 0.27 0.38 η 0.62 0.82 0.88 0.95 η(x) = 1 1+e f (x) Gradients of Ψ and λ measure local variations of Ψ and l when f is not equal to f. 22 / 36

The optimal threshold θ Ψ = classification accuracy = θ = 0 (η = 1 2 ). 23 / 36

The optimal threshold θ Ψ = classification accuracy = θ = 0 (η = 1 2 ). Ψ = weighted accuracy = η = w w + + w. 23 / 36

The optimal threshold θ Ψ = classification accuracy = θ = 0 (η = 1 2 ). Ψ = weighted accuracy = η = w w + + w. More complex Ψ = θ unknown (depends on Ψ )... 23 / 36

The optimal threshold θ Ψ = classification accuracy = θ = 0 (η = 1 2 ). Ψ = weighted accuracy = η = w w + + w. More complex Ψ = θ unknown (depends on Ψ )... = estimate θ from validation data 23 / 36

Tuning the threshold Corollary Given real-valued function f, validation sample of size m, let: ˆθ = argmax θ Ψ(h f,θ ) ( Ψ = estimate of Ψ) Then, under the same assumptions and notation: ( ) 2 Reg Ψ (h f,ˆθ) C Ψ λ Regl (f) + O 1 m. 24 / 36

Tuning the threshold Corollary Given real-valued function f, validation sample of size m, let: ˆθ = argmax θ Ψ(h f,θ ) ( Ψ = estimate of Ψ) Then, under the same assumptions and notation: ( ) 2 Reg Ψ (h f,ˆθ) C Ψ λ Regl (f) + O 1 m. Learning standard binary classifier and tuning the threshold afterwards is able to recover the maximizer of Ψ in the limit. 24 / 36

Multilabel classification A vector of m labels y = (y 1,..., y m ) for each x. Multilabel classifier h(x) = (h 1 (x),..., h m (x)). False positive/negative rates for each label: FP i (h i ) = Pr(h i = 1, y i = 1), FN i (h i ) = Pr(h i = 1, y i = 1). 25 / 36

Multilabel classification A vector of m labels y = (y 1,..., y m ) for each x. Multilabel classifier h(x) = (h 1 (x),..., h m (x)). False positive/negative rates for each label: FP i (h i ) = Pr(h i = 1, y i = 1), FN i (h i ) = Pr(h i = 1, y i = 1). How to use binary classification Ψ in the multilabel setting? We extend our bounds to cover micro- and macro-averaging. 25 / 36

Micro- and macro-averaging Macro-averaging Average outside Ψ: Ψ macro (h) = m Ψ ( FP i (h i ), FN i (h i ) ). i=1 Our bound suggests that a separate threshold needs to be tuned for each label. 26 / 36

Micro- and macro-averaging Macro-averaging Average outside Ψ: Ψ macro (h) = m Ψ ( FP i (h i ), FN i (h i ) ). i=1 Our bound suggests that a separate threshold needs to be tuned for each label. Micro-averaging Average inside Ψ: ( 1 m Ψ micro (h) = Ψ FP i (h i ), 1 m m i=1 ) m FN i (h i ). i=1 Our bound suggests that all labels share a single threshold. 26 / 36

Experiments 27 / 36

Experimental results Two synthetic and two benchmark data sets. Surrogates: Logistic loss (LR) and hinge loss (SVM). Performance metrics: F-measure and AM measure. Minimize l (logistic or hinge) on training data + tune the threshold ˆθ on validation data (optimizing the metrics). 28 / 36

Synthetic experiment I X = {1, 2,..., 25} with Pr(x) uniform. For each x X, η(x) Unif[0, 1]. logistic loss surrogate (converges as expected) Regret of F measure, the AM measure, and logistic loss 0.00 0.02 0.04 0.06 0.08 0.10 Logistic regret F measure regret AM regret 0 2000 4000 6000 8000 10000 # of training examples hinge loss surrogate (does not converge) Regret of F measure, the AM measure and hinge loss 0.00 0.02 0.04 0.06 0.08 0.10 Hinge regret F measure regret AM regret 0 2000 4000 6000 8000 10000 # of training examples 29 / 36

Synthetic experiment II X = R 2, x N(0, I). Logistic model: η(x) = (1 + exp( a 0 a x)) 1. Convergence of F-measure Regret of the F measure and surrogate losses 0.00 0.05 0.10 0.15 0.20 Logistic regret (LR) Hinge regret (SVM) F measure regret (LR) F measure regret (SVM) 0 500 1000 1500 2000 2500 3000 # of training examples Convergence of AM measure Regret of the AM measure and surrogate losses 0.00 0.05 0.10 0.15 0.20 Logistic regret (LR) Hinge regret (SVM) AM regret (LR) AM regret (SVM) 0 500 1000 1500 2000 2500 3000 # of training examples 30 / 36

Benchmark data experiment a bit of surprise dataset #examples #features covtype.binary 581,012 54 gisette 7,000 5,000 covtype.binary gisette F measure 0.65 0.70 0.75 0.80 F measure (LR) F measure (SVM) F measure 0.88 0.92 0.96 1.00 F measure (LR) F measure (SVM) 0 50000 100000 150000 200000 # of training examples 0 1000 2000 3000 4000 # of training examples AM 0.60 0.65 0.70 0.75 0.80 AM (LR) AM (SVM) AM 0.85 0.90 0.95 1.00 AM (LR) AM (SVM) 0 50000 100000 150000 200000 # of training examples 0 1000 2000 3000 4000 # of training examples Logistic loss as expected, but also hinge loss surprisingly well. 31 / 36

Multilabel classification data set # labels # training examples # test examples #features scene 6 1211 1169 294 yeast 14 1500 917 103 mediamill 101 30993 12914 120 Surrogates: Logistic loss (LR) and hinge loss (SVM). Performance metrics: F-measure and AM measure. Macro-averaging (separate threshold for each label) and micro-averaging (single threshold for all labels). Cross-evaluation of algorithms tuned for micro-averaging in terms of macro-averaged metrics, and vice versa. 32 / 36

Multilabel classification: F measure F-macro F-micro scene Macro F measure 0.50 0.55 0.60 0.65 0.70 LR Macro F SVM Macro F LR Micro F SVM Micro F Micro F measure 0.50 0.55 0.60 0.65 0.70 LR Macro F SVM Macro F LR Micro F SVM Micro F 200 400 600 800 1000 1200 # of training examples 200 400 600 800 1000 1200 # of training examples yeast Macro F measure 0.25 0.30 0.35 0.40 0.45 0.50 LR Macro F SVM Macro F LR Micro F SVM Micro F Micro F measure 0.50 0.55 0.60 0.65 LR Macro F SVM Macro F LR Micro F SVM Micro F 200 400 600 800 1000 1200 1400 # of training examples 200 400 600 800 1000 1200 1400 # of training examples mediamill Macro F measure 0.00 0.05 0.10 0.15 0.20 LR Macro F SVM Macro F LR Micro F SVM Micro F Micro F measure 0.2 0.3 0.4 0.5 0.6 LR Macro F SVMs Macro F LR Micro F SVM Micro F 0 5000 10000 15000 20000 25000 30000 # of training examples 0 5000 10000 15000 20000 25000 30000 # of training examples 33 / 36

Multilabel classification: AM measure AM-macro AM-micro scene Macro AM 0.70 0.75 0.80 0.85 LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM Micro AM 0.70 0.75 0.80 0.85 LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM 200 400 600 800 1000 1200 # of training examples 200 400 600 800 1000 1200 # of training examples yeast Macro AM 0.54 0.56 0.58 0.60 0.62 LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM Micro AM 0.50 0.55 0.60 0.65 0.70 0.75 LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM 200 400 600 800 1000 1200 1400 # of training examples 200 400 600 800 1000 1200 1400 # of training examples mediamill Macro AM 0.45 0.50 0.55 0.60 0.65 0.70 LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM Micro AM 0.60 0.70 0.80 0.90 LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM 0 5000 10000 15000 20000 25000 30000 # of training examples 0 5000 10000 15000 20000 25000 30000 # of training examples 34 / 36

Open problem Ψ(h ) Ψ(h) const Risk l (f) Risk l (f ) f is the minimizer over all functions 35 / 36

Open problem Ψ(h ) Ψ(h) const Risk l (f) Risk l (f ) f is the minimizer over all functions We often optimize within some class F (e.g., linear functions). If f / F r.h.s. does not converge to 0. This can be beneficial non non-proper losses (e.g., hinge loss). Most often f / F, and a theory for this is needed. 35 / 36

Summary Theoretical analysis of the two-step approach to optimize generalized performance metrics for classification. Regret bounds for linear-fractional and convex functions optimized by means of strongly proper composite surrogates. The theorem relates convergence to the global risk minimizer. Can we say anything about convergence to the risk minimizer within some class of functions? Why does hinge loss perform so well if the risk minimizer is outside the family of classification functions. 36 / 36