Surrogate regret bounds for generalized classification performance metrics
|
|
- Fay Wheeler
- 6 years ago
- Views:
Transcription
1 Surrogate regret bounds for generalized classification performance metrics Wojciech Kotłowski Krzysztof Dembczyński Poznań University of Technology PL-SIGML, Częstochowa, / 36
2 Motivation 2 / 36
3 Kaggle Higgs Boson Machine Learning Challenge 3 / 36
4 Data set and rules Classes: signal (h τ + τ ) and background. # events # features % signal weight / 36
5 Data set and rules Classes: signal (h τ + τ ) and background. # events # features % signal weight Evaluation: Approximate Median Significance (AMS): ( AMS = 2(s + b + 10) log 1 + s ) s b + 10 s, b weight of signal/background events classified as signal. 4 / 36
6 How to optimize AMS? Research Problem How to optimize a global function of true/false positives/negatives, not decomposable into individual losses over the observations? 5 / 36
7 How to optimize AMS? Research Problem How to optimize a global function of true/false positives/negatives, not decomposable into individual losses over the observations? Most popular approach: Sort classifier s scores and threshold to maximize AMS. threshold classifier s score classified as negative classified as positive AMS not used while training, only for tuning the threshold. 5 / 36
8 How to optimize AMS? Research Problem How to optimize a global function of true/false positives/negatives, not decomposable into individual losses over the observations? Most popular approach: Sort classifier s scores and threshold to maximize AMS. threshold classifier s score classified as negative classified as positive AMS not used while training, only for tuning the threshold. Is this approach theoretically justified? 5 / 36
9 Optimization of generalized performance metrics Wojciech Kotłowski and Krzysztof Dembczyński. Surrogate regret bounds for generalized classification performance metrics. In ACML, volume 45 of JMLR W&C Proc., pages , 2015 [Best Paper Award] Wojciech Kotłowski. Consistent optimization of AMS by logistic loss minimization. In NIPS HEPML Workshop, volume 42 of JMLR W&C Proc., pages , / 36
10 Generalized performance metrics for binary classification Given a binary classifier h: X { 1, 1}, define: ( ) Ψ(h) = Ψ FP(h), FN(h), where: FP(h) = Pr(h(x) = 1 y = 1), FN(h) = Pr(h(x) = 1 y = 1). true y predicted ŷ = h(x) 1 +1 total 1 TN FP 1 P +1 FN TP P Goal: maximize Ψ(h). We assume Ψ is non-increasing in FP and FN. 7 / 36
11 Linear-fractional performance metric Definition Ψ(FP, FN) = a 0 + a 1 FP + a 2 FN b 0 + b 1 FP + b 2 FN, 8 / 36
12 Linear-fractional performance metric Definition Ψ(FP, FN) = a 0 + a 1 FP + a 2 FN b 0 + b 1 FP + b 2 FN, Examples Accuracy Acc = 1 FN FP F β -measure F β = (1+β2 )(P FN) (1+β 2 )P FN+FP Jaccard similarity J = P FN P +FP AM measure AM = 1 1 2P FN 1 2(1 P ) FP Weighted accuracy WA = 1 w FP w + FN 8 / 36
13 Convex performance metrics Definition Ψ(FP, FN) is jointly convex in FP and FN. 9 / 36
14 Convex performance metrics Definition Ψ(FP, FN) is jointly convex in FP and FN. Example: AMS 2 score ( ( AMS 2 (TP, FP) = 2 (TP + FP) log 1 + TP ) ) TP. FP 9 / 36
15 Example - F 1 -measure F FP FN 10 / 36
16 Example - AMS 2 score AMS FP FN 11 / 36
17 A simple approach to optimization of Ψ(h) training data
18 A simple approach to optimization of Ψ(h) training data f(x) learn real-valued f(x) (using standard classification tool)
19 A simple approach to optimization of Ψ(h) training data validation data learn real-valued f(x) (using standard classification tool) f(x)
20 A simple approach to optimization of Ψ(h) training data validation data learn real-valued f(x) (using standard classification tool) f(x) learn a threshold θ on f(x) by optimizing Ψ(h f,θ ) θ h f,θ (x) = 1 h f,θ (x) = +1 f(x) 12 / 36
21 Our results Algorithm training set f(x) θ validation set h f,θ (x) 1 Learn f minimizing a surrogate loss on the training sample. 2 Given f, tune a threshold θ on f on a the validation sample by direct optimization of Ψ. 13 / 36
22 Our results Algorithm training set f(x) θ validation set h f,θ (x) 1 Learn f minimizing a surrogate loss on the training sample. 2 Given f, tune a threshold θ on f on a the validation sample by direct optimization of Ψ. Our results (informally) Assumptions: Claim: the surrogate loss is strongly proper composite (e.g., logistic, exponential, squared-error loss), Ψ is linear-fractional or jointly convex, If f is close to the minimizer of the surrogate loss, then h f,θ is close to the maximizer of Ψ. 13 / 36
23 Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h 14 / 36
24 Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h Surrogate loss l(y, f(x)) of a real-valued function f : X R. Used in training: logistic loss, squared loss, hinge loss, / 36
25 Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h Surrogate loss l(y, f(x)) of a real-valued function f : X R. Used in training: logistic loss, squared loss, hinge loss,... Expected loss (l-risk) of f: Risk l (f) = E (x,y) [l(y, f(x))]. 14 / 36
26 Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h Surrogate loss l(y, f(x)) of a real-valued function f : X R. Used in training: logistic loss, squared loss, hinge loss,... Expected loss (l-risk) of f: Risk l (f) = E (x,y) [l(y, f(x))]. l-regret of f: Reg l (f) = Risk l (f) Risk l (f ) where f = argmin Risk l (f). f 14 / 36
27 Ψ-regret and l-regret Ψ-regret of a classifier h: X { 1, 1}: Reg Ψ (h) = Ψ(h ) Ψ(h) Measures suboptimality. where h = argmax Ψ(h). h Surrogate loss l(y, f(x)) of a real-valued function f : X R. Used in training: logistic loss, squared loss, hinge loss,... Relate Ψ-regret of h f,θ to l-regret of f Expected loss (l-risk) of f: Risk l (f) = E (x,y) [l(y, f(x))]. l-regret of f: Reg l (f) = Risk l (f) Risk l (f ) where f = argmin Risk l (f). f 14 / 36
28 Examples of surrogate losses Logistic loss Risk minimizer f (x): f (x) = log ( l(y, ŷ) = log 1 + e yŷ). η(x), where η(x) = Pr(y = 1 x). 1 η(x) Invertible function of conditional probability η(x). 15 / 36
29 Examples of surrogate losses Logistic loss Risk minimizer f (x): f (x) = log ( l(y, ŷ) = log 1 + e yŷ). η(x), where η(x) = Pr(y = 1 x). 1 η(x) Invertible function of conditional probability η(x). Hinge loss l(y, ŷ) = (1 yŷ) +. Its risk minimizer f (x) is non-invertible: f (x) = sgn(η(x) 1/2). 15 / 36
30 Examples of surrogate losses loss /1 loss squared error loss logistic loss hinge loss exponential loss yf(x) 16 / 36
31 Examples of surrogate losses loss f (η) = ψ(η) η(f ) = ψ 1 (f ) squared error 2η 1 logistic log η 1 η exponential 1 2 log η 1 η 1+f e f 1 1+e 2f hinge sgn(η 1/2) doesn t exist 17 / 36
32 Proper composite losses l(y, f) is proper composite if there exists a strictly increasing link function ψ, such that: f (x) = ψ(η(x)), where η(x) = Pr(y = 1 x). Minimizing proper composite losses implies probability estimation. 18 / 36
33 Strongly proper composite losses [Agarwal, 2014] l(y, f) is λ-strongly proper composite if it is proper composite and for any f, x, and distribution y x: E y x [l(y, f(x)) l(y, f (x))] λ 2 ( η(x) ψ 1 (f(x)) ) 2. Technical condition. 19 / 36
34 Strongly proper composite losses: Examples loss f (η) = ψ(η) η(f ) = ψ 1 (f ) λ squared error 2η 1 logistic log η 1 η 1 exponential 2 log η 1 η 1+f e f e 2f 4 20 / 36
35 Main result Theorem for linear fractional measures If: Ψ(FP, FN) is linear-fractional, non-increasing in FP and FN, l is λ-strongly proper composite, Then, there exists a threshold θ, such that for any real-valued function f, Reg Ψ (h f,θ ) C Ψ 2 λ Regl (f). 21 / 36
36 Main result Theorem for linear fractional measures If: Ψ(FP, FN) is linear-fractional, non-increasing in FP and FN, l is λ-strongly proper composite, Then, there exists a threshold θ, such that for any real-valued function f, Reg Ψ (h f,θ ) C Ψ 2 λ Regl (f). metric C Ψ F β -measure 1+β 2 β 2 P Jaccard similarity J +1 P AM measure 1 2P (1 P ) 21 / 36
37 Main result Theorem for linear fractional measures If: Ψ(FP, FN) is linear-fractional, non-increasing in FP and FN, l is λ-strongly proper composite, Then, theresimilar exists atheorem thresholdfor θ, convex such that performance any real-valued function f, metrics (such as AMS 2 ) Reg Ψ (h f,θ ) C Ψ 2 λ Regl (f). metric F β -measure C Ψ 1+β 2 β 2 P Jaccard similarity J +1 P AM measure 1 2P (1 P ) 21 / 36
38 Explanation of the theorem The maximizer of Ψ is h (x) = sgn(η(x) η ) for some η. 22 / 36
39 Explanation of the theorem The maximizer of Ψ is h (x) = sgn(η(x) η ) for some η. The minimizer of l is f (x) = ψ(η(x)) for invertible ψ. 22 / 36
40 Explanation of the theorem The maximizer of Ψ is h (x) = sgn(η(x) η ) for some η. The minimizer of l is f (x) = ψ(η(x)) for invertible ψ. Thresholding f (x) at θ = ψ(η ) Thresholding η(x) at η. θ = ψ(η ) f (x) η η(x) = 1 1+e f (x) 22 / 36
41 Explanation of the theorem The maximizer of Ψ is h (x) = sgn(η(x) η ) for some η. The minimizer of l is f (x) = ψ(η(x)) for invertible ψ. Thresholding f (x) at θ = ψ(η ) Thresholding η(x) at η. θ = ψ(η ) f (x) η η(x) = 1 1+e f (x) Gradients of Ψ and λ measure local variations of Ψ and l when f is not equal to f. 22 / 36
42 The optimal threshold θ Ψ = classification accuracy = θ = 0 (η = 1 2 ). 23 / 36
43 The optimal threshold θ Ψ = classification accuracy = θ = 0 (η = 1 2 ). Ψ = weighted accuracy = η = w w + + w. 23 / 36
44 The optimal threshold θ Ψ = classification accuracy = θ = 0 (η = 1 2 ). Ψ = weighted accuracy = η = w w + + w. More complex Ψ = θ unknown (depends on Ψ ) / 36
45 The optimal threshold θ Ψ = classification accuracy = θ = 0 (η = 1 2 ). Ψ = weighted accuracy = η = w w + + w. More complex Ψ = θ unknown (depends on Ψ )... = estimate θ from validation data 23 / 36
46 Tuning the threshold Corollary Given real-valued function f, validation sample of size m, let: ˆθ = argmax θ Ψ(h f,θ ) ( Ψ = estimate of Ψ) Then, under the same assumptions and notation: ( ) 2 Reg Ψ (h f,ˆθ) C Ψ λ Regl (f) + O 1 m. 24 / 36
47 Tuning the threshold Corollary Given real-valued function f, validation sample of size m, let: ˆθ = argmax θ Ψ(h f,θ ) ( Ψ = estimate of Ψ) Then, under the same assumptions and notation: ( ) 2 Reg Ψ (h f,ˆθ) C Ψ λ Regl (f) + O 1 m. Learning standard binary classifier and tuning the threshold afterwards is able to recover the maximizer of Ψ in the limit. 24 / 36
48 Multilabel classification A vector of m labels y = (y 1,..., y m ) for each x. Multilabel classifier h(x) = (h 1 (x),..., h m (x)). False positive/negative rates for each label: FP i (h i ) = Pr(h i = 1, y i = 1), FN i (h i ) = Pr(h i = 1, y i = 1). 25 / 36
49 Multilabel classification A vector of m labels y = (y 1,..., y m ) for each x. Multilabel classifier h(x) = (h 1 (x),..., h m (x)). False positive/negative rates for each label: FP i (h i ) = Pr(h i = 1, y i = 1), FN i (h i ) = Pr(h i = 1, y i = 1). How to use binary classification Ψ in the multilabel setting? We extend our bounds to cover micro- and macro-averaging. 25 / 36
50 Micro- and macro-averaging Macro-averaging Average outside Ψ: Ψ macro (h) = m Ψ ( FP i (h i ), FN i (h i ) ). i=1 Our bound suggests that a separate threshold needs to be tuned for each label. 26 / 36
51 Micro- and macro-averaging Macro-averaging Average outside Ψ: Ψ macro (h) = m Ψ ( FP i (h i ), FN i (h i ) ). i=1 Our bound suggests that a separate threshold needs to be tuned for each label. Micro-averaging Average inside Ψ: ( 1 m Ψ micro (h) = Ψ FP i (h i ), 1 m m i=1 ) m FN i (h i ). i=1 Our bound suggests that all labels share a single threshold. 26 / 36
52 Experiments 27 / 36
53 Experimental results Two synthetic and two benchmark data sets. Surrogates: Logistic loss (LR) and hinge loss (SVM). Performance metrics: F-measure and AM measure. Minimize l (logistic or hinge) on training data + tune the threshold ˆθ on validation data (optimizing the metrics). 28 / 36
54 Synthetic experiment I X = {1, 2,..., 25} with Pr(x) uniform. For each x X, η(x) Unif[0, 1]. logistic loss surrogate (converges as expected) Regret of F measure, the AM measure, and logistic loss Logistic regret F measure regret AM regret # of training examples hinge loss surrogate (does not converge) Regret of F measure, the AM measure and hinge loss Hinge regret F measure regret AM regret # of training examples 29 / 36
55 Synthetic experiment II X = R 2, x N(0, I). Logistic model: η(x) = (1 + exp( a 0 a x)) 1. Convergence of F-measure Regret of the F measure and surrogate losses Logistic regret (LR) Hinge regret (SVM) F measure regret (LR) F measure regret (SVM) # of training examples Convergence of AM measure Regret of the AM measure and surrogate losses Logistic regret (LR) Hinge regret (SVM) AM regret (LR) AM regret (SVM) # of training examples 30 / 36
56 Benchmark data experiment a bit of surprise dataset #examples #features covtype.binary 581, gisette 7,000 5,000 covtype.binary gisette F measure F measure (LR) F measure (SVM) F measure F measure (LR) F measure (SVM) # of training examples # of training examples AM AM (LR) AM (SVM) AM AM (LR) AM (SVM) # of training examples # of training examples Logistic loss as expected, but also hinge loss surprisingly well. 31 / 36
57 Multilabel classification data set # labels # training examples # test examples #features scene yeast mediamill Surrogates: Logistic loss (LR) and hinge loss (SVM). Performance metrics: F-measure and AM measure. Macro-averaging (separate threshold for each label) and micro-averaging (single threshold for all labels). Cross-evaluation of algorithms tuned for micro-averaging in terms of macro-averaged metrics, and vice versa. 32 / 36
58 Multilabel classification: F measure F-macro F-micro scene Macro F measure LR Macro F SVM Macro F LR Micro F SVM Micro F Micro F measure LR Macro F SVM Macro F LR Micro F SVM Micro F # of training examples # of training examples yeast Macro F measure LR Macro F SVM Macro F LR Micro F SVM Micro F Micro F measure LR Macro F SVM Macro F LR Micro F SVM Micro F # of training examples # of training examples mediamill Macro F measure LR Macro F SVM Macro F LR Micro F SVM Micro F Micro F measure LR Macro F SVMs Macro F LR Micro F SVM Micro F # of training examples # of training examples 33 / 36
59 Multilabel classification: AM measure AM-macro AM-micro scene Macro AM LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM Micro AM LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM # of training examples # of training examples yeast Macro AM LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM Micro AM LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM # of training examples # of training examples mediamill Macro AM LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM Micro AM LR Macro AM SVM Macro AM LR Micro AM SVM Micro AM # of training examples # of training examples 34 / 36
60 Open problem Ψ(h ) Ψ(h) const Risk l (f) Risk l (f ) f is the minimizer over all functions 35 / 36
61 Open problem Ψ(h ) Ψ(h) const Risk l (f) Risk l (f ) f is the minimizer over all functions We often optimize within some class F (e.g., linear functions). If f / F r.h.s. does not converge to 0. This can be beneficial non non-proper losses (e.g., hinge loss). Most often f / F, and a theory for this is needed. 35 / 36
62 Summary Theoretical analysis of the two-step approach to optimize generalized performance metrics for classification. Regret bounds for linear-fractional and convex functions optimized by means of strongly proper composite surrogates. The theorem relates convergence to the global risk minimizer. Can we say anything about convergence to the risk minimizer within some class of functions? Why does hinge loss perform so well if the risk minimizer is outside the family of classification functions. 36 / 36
A Simple Algorithm for Multilabel Ranking
A Simple Algorithm for Multilabel Ranking Krzysztof Dembczyński 1 Wojciech Kot lowski 1 Eyke Hüllermeier 2 1 Intelligent Decision Support Systems Laboratory (IDSS), Poznań University of Technology, Poland
More informationBinary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach
Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Krzysztof Dembczyński and Wojciech Kot lowski Intelligent Decision Support Systems Laboratory (IDSS) Poznań
More informationClassification objectives COMS 4771
Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification
More informationRegret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss
Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss Krzysztof Dembczyński 1, Willem Waegeman 2, Weiwei Cheng 1, and Eyke Hüllermeier 1 1 Knowledge
More informationEvaluation requires to define performance measures to be optimized
Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation
More informationEvaluation. Andrea Passerini Machine Learning. Evaluation
Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain
More informationBeyond Accuracy: Scalable Classification with Complex Metrics
Beyond Accuracy: Scalable Classification with Complex Metrics Sanmi Koyejo University of Illinois at Urbana-Champaign Joint work with B. Yan @UT Austin K. Zhong @UT Austin P. Ravikumar @CMU N. Natarajan
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationDirectly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers
Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers Hiva Ghanbari Joint work with Prof. Katya Scheinberg Industrial and Systems Engineering Department US & Mexico Workshop
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationDS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM
DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM Due: Monday, April 11, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationOn the Consistency of AUC Pairwise Optimization
On the Consistency of AUC Pairwise Optimization Wei Gao and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University Collaborative Innovation Center of Novel Software Technology
More informationClass 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio
Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationCalibrated Surrogate Losses
EECS 598: Statistical Learning Theory, Winter 2014 Topic 14 Calibrated Surrogate Losses Lecturer: Clayton Scott Scribe: Efrén Cruz Cortés Disclaimer: These notes have not been subjected to the usual scrutiny
More informationBinary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach
Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Krzysztof Dembczyński and Wojciech Kot lowski Intelligent Decision Support Systems Laboratory (IDSS) Poznań
More informationOrdinal Classification with Decision Rules
Ordinal Classification with Decision Rules Krzysztof Dembczyński 1, Wojciech Kotłowski 1, and Roman Słowiński 1,2 1 Institute of Computing Science, Poznań University of Technology, 60-965 Poznań, Poland
More informationA Blended Metric for Multi-label Optimisation and Evaluation
A Blended Metric for Multi-label Optimisation and Evaluation Laurence A. F. Park 1 and Jesse Read 1 School of Computing, Engineering and Mathematics, Western Sydney University, Australia. lapark@scem.westernsydney.edu.au
More informationOptimal Classification with Multivariate Losses
Nagarajan Natarajan Microsoft Research, INDIA Oluwasanmi Koyejo Stanford University, CA & University of Illinois at Urbana-Champaign, IL, USA Pradeep Ravikumar Inderjit S. Dhillon The University of Texas
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationStatistical Properties of Large Margin Classifiers
Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationDay 4: Classification, support vector machines
Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018 Topics so far
More informationVariations of Logistic Regression with Stochastic Gradient Descent
Variations of Logistic Regression with Stochastic Gradient Descent Panqu Wang(pawang@ucsd.edu) Phuc Xuan Nguyen(pxn002@ucsd.edu) January 26, 2012 Abstract In this paper, we extend the traditional logistic
More informationQualifying Exam in Machine Learning
Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts
More informationThe exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.
CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More informationReducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationOn the Bayes-Optimality of F-Measure Maximizers
Journal of Machine Learning Research 15 (2014) 3513-3568 Submitted 10/13; Revised 6/14; Published 11/14 On the Bayes-Optimality of F-Measure Maximizers Willem Waegeman willem.waegeman@ugent.be Department
More informationLecture 16: Perceptron and Exponential Weights Algorithm
EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew
More information4.1 Online Convex Optimization
CS/CNS/EE 53: Advanced Topics in Machine Learning Topic: Online Convex Optimization and Online SVM Lecturer: Daniel Golovin Scribe: Xiaodi Hou Date: Jan 3, 4. Online Convex Optimization Definition 4..
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationIntroduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More information15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation
15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation J. Zico Kolter Carnegie Mellon University Fall 2016 1 Outline Example: return to peak demand prediction
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationLecture 4 Discriminant Analysis, k-nearest Neighbors
Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se
More informationAnalysis and Optimization of Loss Functions for Multiclass, Top-k, and Multilabel Classification
Analysis and Optimization of Loss Functions for Multiclass, Top-k, and Multilabel Classification arxiv:62.03663v [cs.cv] 2 Dec 206 Maksim Lapin, Matthias Hein, and Bernt Schiele Abstract Top-k error is
More informationOnline Advertising is Big Business
Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA
More informationA Magiv CV Theory for Large-Margin Classifiers
A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationAdaBoost and other Large Margin Classifiers: Convexity in Classification
AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationINTRODUCTION TO DATA SCIENCE
INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationRelationship between Loss Functions and Confirmation Measures
Relationship between Loss Functions and Confirmation Measures Krzysztof Dembczyński 1 and Salvatore Greco 2 and Wojciech Kotłowski 1 and Roman Słowiński 1,3 1 Institute of Computing Science, Poznań University
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationMachine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationLogistic Regression. William Cohen
Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting
More informationThe AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m
) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ
More informationBayesian Decision Theory
Introduction to Pattern Recognition [ Part 4 ] Mahdi Vasighi Remarks It is quite common to assume that the data in each class are adequately described by a Gaussian distribution. Bayesian classifier is
More informationDecoupled Collaborative Ranking
Decoupled Collaborative Ranking Jun Hu, Ping Li April 24, 2017 Jun Hu, Ping Li WWW2017 April 24, 2017 1 / 36 Recommender Systems Recommendation system is an information filtering technique, which provides
More informationNotes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces
Notes on the framework of Ando and Zhang (2005 Karl Stratos 1 Beyond learning good functions: learning good spaces 1.1 A single binary classification problem Let X denote the problem domain. Suppose we
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationOnline Advertising is Big Business
Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA
More informationLecture 3: Multiclass Classification
Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in
More informationCutting Plane Training of Structural SVM
Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs
More informationSurrogate loss functions, divergences and decentralized detection
Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationML4NLP Multiclass Classification
ML4NLP Multiclass Classification CS 590NLP Dan Goldwasser Purdue University dgoldwas@purdue.edu Social NLP Last week we discussed the speed-dates paper. Interesting perspective on NLP problems- Can we
More informationMachine Learning. Ludovic Samper. September 1st, Antidot. Ludovic Samper (Antidot) Machine Learning September 1st, / 77
Machine Learning Ludovic Samper Antidot September 1st, 2015 Ludovic Samper (Antidot) Machine Learning September 1st, 2015 1 / 77 Antidot Software vendor since 1999 Paris, Lyon, Aix-en-Provence 45 employees
More informationLecture 10 February 23
EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only
More informationSVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning
SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationAlgorithms for Predicting Structured Data
1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure
More informationSUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION
SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology
More informationListwise Approach to Learning to Rank Theory and Algorithm
Listwise Approach to Learning to Rank Theory and Algorithm Fen Xia *, Tie-Yan Liu Jue Wang, Wensheng Zhang and Hang Li Microsoft Research Asia Chinese Academy of Sciences document s Learning to Rank for
More informationMLCC 2017 Regularization Networks I: Linear Models
MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear
More informationSupport Vector Machines, Kernel SVM
Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM
More informationIntroduction to Machine Learning (67577) Lecture 3
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationRandomized Smoothing for Stochastic Optimization
Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and
More informationABC-LogitBoost for Multi-Class Classification
Ping Li, Cornell University ABC-Boost BTRY 6520 Fall 2012 1 ABC-LogitBoost for Multi-Class Classification Ping Li Department of Statistical Science Cornell University 2 4 6 8 10 12 14 16 2 4 6 8 10 12
More informationRegret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss
Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss Krzysztof Dembczyński 1,3, Willem Waegeman 2, Weiwei Cheng 1, and Eyke Hüllermeier 1
More informationSubgradient Descent. David S. Rosenberg. New York University. February 7, 2018
Subgradient Descent David S. Rosenberg New York University February 7, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 1 / 43 Contents 1 Motivation and Review:
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear
More informationarxiv: v1 [stat.ml] 8 Feb 2014
F1-Optimal Thresholding in the Multi-Label Setting arxiv:1402.1892v1 [stat.ml] 8 Feb 2014 Zachary Chase Lipton, Charles Elkan, Balakrishnan (Murali) Naryanaswamy {zlipton,elkan,muralib}@cs.ucsd.edu University
More information1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 October 31, 2016 Due: A. November 11, 2016; B. November 22, 2016 A. Boosting 1. Implement
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationLecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University
Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations
More informationMethods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro
Methods and Criteria for Model Selection CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Introduce classifier evaluation criteria } Introduce Bias x Variance duality } Model Assessment }
More informationClassification with Reject Option
Classification with Reject Option Bartlett and Wegkamp (2008) Wegkamp and Yuan (2010) February 17, 2012 Outline. Introduction.. Classification with reject option. Spirit of the papers BW2008.. Infinite
More informationOn the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost Hamed Masnadi-Shirazi Statistical Visual Computing Laboratory, University of California, San Diego La
More informationPerformance Evaluation
Performance Evaluation David S. Rosenberg Bloomberg ML EDU October 26, 2017 David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 1 / 36 Baseline Models David S. Rosenberg (Bloomberg ML EDU) October 26,
More informationLinear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining
Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes November 9, 2012 1 / 1 Nearest centroid rule Suppose we break down our data matrix as by the labels yielding (X
More informationFast Rates for Estimation Error and Oracle Inequalities for Model Selection
Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu
More informationOn Equivalence Relationships Between Classification and Ranking Algorithms
On Equivalence Relationships Between Classification and Ranking Algorithms On Equivalence Relationships Between Classification and Ranking Algorithms Şeyda Ertekin Cynthia Rudin MIT Sloan School of Management
More informationMachine Learning and Data Mining. Linear classification. Kalev Kask
Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationMachine Learning. Lecture 3: Logistic Regression. Feng Li.
Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification
More information