Learning from Corrupted Binary Labels via Class-Probability Estimation

Similar documents
Learning from Corrupted Binary Labels via Class-Probability Estimation (Full Version)

Classification objectives COMS 4771

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Logistic Regression Trained with Different Loss Functions. Discussion

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Instance-Dependent PU Learning by Bayesian Optimal Relabeling

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Logistic Regression. Machine Learning Fall 2018

Class Prior Estimation from Positive and Unlabeled Data

Performance Evaluation

Machine Learning

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Machine Learning

Midterm, Fall 2003

Lecture 3 Classification, Logistic Regression

Support Vector Machines

Stochastic Gradient Descent

Midterm: CS 6375 Spring 2015 Solutions

Stephen Scott.

The sample complexity of agnostic learning with deterministic labels

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

VBM683 Machine Learning

ECE 5424: Introduction to Machine Learning

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

arxiv: v3 [stat.ml] 9 Aug 2017

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

outline Nonlinear transformation Error measures Noisy targets Preambles to the theory

Final Exam, Fall 2002

Cheng Soon Ong & Christian Walder. Canberra February June 2018

FINAL: CS 6375 (Machine Learning) Fall 2014

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Qualifier: CS 6375 Machine Learning Spring 2015

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

CSE 546 Final Exam, Autumn 2013

Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

CS145: INTRODUCTION TO DATA MINING

IFT Lecture 7 Elements of statistical learning theory

Introduction to Bayesian Learning. Machine Learning Fall 2018

CS534 Machine Learning - Spring Final Exam

Beyond Accuracy: Scalable Classification with Complex Metrics

Machine Learning, Midterm Exam

Lecture 7 Introduction to Statistical Decision Theory

Consistent Binary Classification with Generalized Performance Metrics

ECE521 week 3: 23/26 January 2017

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Introduction to Machine Learning Midterm, Tues April 8

Cross-Modal Retrieval: A Pairwise Classification Approach

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Pointwise Exact Bootstrap Distributions of Cost Curves

ECE 5984: Introduction to Machine Learning

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Support Vector Machines for Classification: A Statistical Portrait

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Active Learning and Optimized Information Gathering

Loss Functions, Decision Theory, and Linear Models

Probabilistic & Unsupervised Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Generative v. Discriminative classifiers Intuition

Warm up: risk prediction with logistic regression

Optimization geometry and implicit regularization

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Logistic Regression. COMP 527 Danushka Bollegala

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

ECE521 Lecture7. Logistic Regression

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

PAC-learning, VC Dimension and Margin-based Bounds

Tufts COMP 135: Introduction to Machine Learning

Final Exam, Machine Learning, Spring 2009

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels

Statistical Data Mining and Machine Learning Hilary Term 2016

Introduction to Machine Learning Midterm Exam

Introduction to Gaussian Processes

Advanced Introduction to Machine Learning CMU-10715

Learning with Rejection

Cheng Soon Ong & Christian Walder. Canberra February June 2018

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Class-prior Estimation for Learning from Positive and Unlabeled Data

Bayesian Methods: Naïve Bayes

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Kernel Logistic Regression and the Import Vector Machine

Week 5: Logistic Regression & Neural Networks

Data Privacy in Biomedicine. Lecture 11b: Performance Measures for System Evaluation

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Practice Page 2 of 2 10/28/13

ABC-LogitBoost for Multi-Class Classification

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Transcription:

Learning from Corrupted Binary Labels via Class-Probability Estimation Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson xxx National ICT Australia and The Australian National University 1 / 57

Learning from binary labels +" +" +" #" +" #" #" #" 2 / 57

Learning from binary labels?" +" #" #" #" +" +" +" #" 3 / 57

Learning from binary labels +" +" +" #" +" #" #" #" 4 / 57

Learning from noisy labels #" #" +" +" +" #" +" #" 5 / 57

Learning from positive and unlabelled data?"?" +"?" +"?"?"?" 6 / 57

Learning from binary labels +" +" +" #" +" #" #" #" nature S D n learner Goal: good classification wrt distribution D 7 / 57

Learning from corrupted labels #" #" +" #" #" +" +" +" S D n S D nature corruptor n learner Goal: good classification wrt (unobserved) distribution D 8 / 57

Paper summary Can we learn a good classifier from corrupted samples? 9 / 57

Paper summary Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes! 10 / 57

Paper summary Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes! can treat samples as if uncorrupted! (Elkan and Noto, 2008), (Zhang and Lee, 2008), (Natarajan et al., 2013), (duplessis and Sugiyama, 2014)... 11 / 57

Paper summary Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes! can treat samples as if uncorrupted! (Elkan and Noto, 2008), (Zhang and Lee, 2008), (Natarajan et al., 2013), (duplessis and Sugiyama, 2014)... This work: unified treatment via class-probability estimation analysis for general class of corruptions 12 / 57

Assumed corruption model 13 / 57

Learning from binary labels: distributions Fix instance space X (e.g. R N ) Underlying distribution D over X {±1} Constituent components of D: (P(x),Q(x),p)=(P[X = x Y = 1],P[X = x Y = 1],P[Y = 1]) 14 / 57

Learning from binary labels: distributions Fix instance space X (e.g. R N ) Underlying distribution D over X {±1} Constituent components of D: (P(x),Q(x),p)=(P[X = x Y = 1],P[X = x Y = 1],P[Y = 1]) (M(x),h(x)) =(P[X = x],p[y = 1 X = x]) 15 / 57

Learning from corrupted binary labels S D n S D nature corruptor n learner Samples from corrupted distribution D =(P, Q, p) Goal: good classification wrt (unobserved) distribution D 16 / 57

Learning from corrupted binary labels S D n S D nature corruptor n learner Samples from corrupted distribution D =(P, Q, p), where P =(1 a) P + a Q and p is arbitrary a,b are noise rates Q = b P +(1 b) Q mutually contaminated distributions (Scott et al., 2013) Goal: good classification wrt (unobserved) distribution D 17 / 57

Special cases Label noise Labels flipped w.p. r PU learning Observe M instead of Q p = (1 p = arbitrary a =p b = (1 2r) p + r 1 (1 p) p) r 1 p r #" #" +" +" P = 1 P+0 Q Q=M = p P + (1 p) Q?"?" +" #" #" +"?"?" +"?" +"?" 18 / 57

Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis 19 / 57

Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis Proposition For any D,D, h(x)=f a,b,p (h(x)) where f a,b,p is strictly monotone for fixed a,b,p. 20 / 57

Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis Proposition For any D,D, h(x)=f a,b,p (h(x)) where f a,b,p is strictly monotone for fixed a,b,p. Follows from Bayes rule: h(x) 1 h(x) = p 1 p P(x) Q(x) 21 / 57

Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis Proposition For any D,D, h(x)=f a,b,p (h(x)) where f a,b,p is strictly monotone for fixed a,b,p. Follows from Bayes rule: h(x) 1 h(x) = p 1 p P(x) Q(x) = p 1 p (1 a) b P(x) Q(x) P(x) Q(x) + a. +(1 b) 22 / 57

Corrupted class-probabilities: special cases Label noise PU learning h(x) =(1 2r) h(x) +r h(x) = p h(x) p h(x)+(1 p) p r unknown p unknown (Natarajan et al., 2013) (Ward et al., 2009) 23 / 57

Roadmap nature D corruptor D class-prob estimator ĥ classifier Kernel logistic regression 24 / 57

Roadmap Exploit monotone relationship between h and h nature D corruptor D class-prob estimator ĥ classifier? Kernel logistic regression 25 / 57

Classification with noise rates 26 / 57

Class-probabilities and classification Many classification measures optimised by sign(h(x) t) 0-1 error! t = 1 2 Balanced error! t = p F-score! optimal t depends on D I (Lipton et al., 2014, Koyejo et al., 2014) 27 / 57

Class-probabilities and classification Many classification measures optimised by sign(h(x) t) 0-1 error! t = 1 2 Balanced error! t = p F-score! optimal t depends on D I (Lipton et al., 2014, Koyejo et al., 2014) We can relate this to thresholding of h! 28 / 57

Corrupted class-probabilities and classification By monotone relationship, h(x) > t () h(x) > f a,b,p (t). Threshold h at f a,b,p (t)! optimal classification on D Can translate into regret bound e.g. for 0-1 loss 29 / 57

Story so far Classification scheme requires: h t a,b,p noise oracle â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) 30 / 57

Story so far Classification scheme requires: h! class-probability estimation t a,b,p noise oracle â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 31 / 57

Story so far Classification scheme requires: h! class-probability estimation t! if unknown, alternate approach (see poster) a,b,p noise oracle â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 32 / 57

Story so far Classification scheme requires: h! class-probability estimation t! if unknown, alternate approach (see poster) a,b,p! can we estimate these? noise estimator? â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 33 / 57

Estimating noise rates: some bad news p strongly non-identifiable! p allowed to be arbitrary (e.g. PU learning) a, b non-identifiable without assumptions (Scott et al., 2013) Can we estimate a, b under assumptions? 34 / 57

Weak separability assumption Assume that D is weakly separable : min h(x)=0 x2x max x2x h(x)=1 i.e. 9 deterministically + ve and - ve instances weaker than full separability 35 / 57

Weak separability assumption Assume that D is weakly separable : min h(x)=0 x2x max x2x h(x)=1 i.e. 9 deterministically + ve and - ve instances weaker than full separability Assumed range of h constrains observed range of h! 36 / 57

Estimating noise rates Proposition Pick any weakly separable D. Then, for any D, a = h min (h max p) p (h max h min ) and b = (1 h max) (p h min ) (1 p) (h max h min ) where h min = min x2x h(x) h max = max x2x h(x) a,b can be estimated from corrupted data alone 37 / 57

Estimating noise rates: special cases Label noise r = 1 = h min h max p = p h min h max h min PU learning a = 0 b = p = 1 h max p h max 1 p (Elkan and Noto, 2008), (Liu and Tao, 2014) c.f. mixture proportion estimate of (Scott et al., 2013) In these cases, p can be estimated as well 38 / 57

Story so far Optimal classification in general requires a, b, p Range of ĥ noise estimator ĥ â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 39 / 57

Story so far Optimal classification in general requires a, b, p when does f a,b,p (t) not depend on a,b,p? Range of ĥ noise estimator ĥ â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 40 / 57

Classification without noise rates 41 / 57

Balanced error (BER) of classifier Balanced error (BER) of a classifier f : X! {±1} is: BER D (f )= FPRD (f )+FNR D (f ) 2 for false positive and negative rates FPR D (f ),FNR D (f ) average classification performance on each class optimal classifier is sign(h(x) p) 42 / 57

BER immunity under corruption Proposition (c.f. (Zhang and Lee, 2008)) For any D,D, and classifier f : X! {±1}, BER D (f )=(1 a b) BER D (f )+ a + b 2 43 / 57

BER immunity under corruption Proposition (c.f. (Zhang and Lee, 2008)) For any D,D, and classifier f : X! {±1}, BER D (f )=(1 a b) BER D (f )+ a + b 2 BER-optimal classifiers on clean and corrupted coincide sign(h(x) p)=sign(h(x) p) 44 / 57

BER immunity under corruption Proposition (c.f. (Zhang and Lee, 2008)) For any D,D, and classifier f : X! {±1}, BER D (f )=(1 a b) BER D (f )+ a + b 2 BER-optimal classifiers on clean and corrupted coincide sign(h(x) p)=sign(h(x) p) Minimise clean BER! don t need to know corruption rates! threshold on h does not need a,b,p 45 / 57

BER immunity & class-probability estimation Trivially, we also have regret D BER (f )=(1 a b) 1 regret D BER (f ). i.e. good corrupted BER =) good clean BER can make regret D BER (f )! 0 by class-probability estimation Similar result for AUC (see poster) 46 / 57

BER immunity under corruption: proof From (Scott et al., 2013), h T FPR D (f ) FNR D (f )i = FPR D (f ) FNR D (f ) apple T 1 b a b 1 a + b a T, 47 / 57

BER immunity under corruption: proof From (Scott et al., 2013), h T FPR D (f ) FNR D (f )i = FPR D (f ) FNR D (f ) apple T 1 b a b 1 a + b a T, and apple 1 1 is an eigenvector of apple 1 b a b 1 a 48 / 57

Are other measures immune? BER is only (non-trivial) performance measure for which: corrupted risk = affine transform of clean risk I because of eigenvector interpretation corrupted threshold is independent of a,b,p I because of nature of f a,b,p (see poster) Other performance measures! need (one of) a,b,p 49 / 57

Experiments 50 / 57

Experimental setup Injected label noise on UCI datasets Estimate corrupted class-probabilities via neural network well-specified if D linearly separable: h(x)=s(hw,xi) =) h(x)=a s(hw,xi)+b Evaluate: reliability of noise estimates BER performance on clean test set I corrupted data used for training and validation 0-1 performance on clean test set (see poster) 51 / 57

Experimental results: noise rates Estimated noise rates are generally reliable Bias of Estimate 0.1 0 0.1 0.2 0.3 segment Mean Median Bias of Estimate 0.1 0.025 0.05 0.125 0.2 0 0.1 0.2 0.3 0.4 0.49 Ground truth noise mnist spambase Mean Median 0 0.1 0.2 0.3 0.4 0.49 Ground truth noise Bias of Estimate 0.02 0.005 0.01 0.025 0.04 Mean Median 0 0.1 0.2 0.3 0.4 0.49 Ground truth noise 52 / 57

Experimental results: BER immunity Generally, low observed degradation in BER Dataset Noise 1 - AUC (%) BER (%) segment spambase mnist None 0.00 ± 0.00 0.00 ± 0.00 (r +,r )=(0.1,0.0) 0.00 ± 0.00 0.01 ± 0.00 (r +,r )=(0.1,0.2) 0.02 ± 0.01 0.90 ± 0.08 (r +,r )=(0.2,0.4) 0.03 ± 0.01 3.24 ± 0.20 None 2.49 ± 0.00 6.93 ± 0.00 (r +,r )=(0.1,0.0) 2.67 ± 0.02 7.10 ± 0.03 (r +,r )=(0.1,0.2) 3.01 ± 0.03 7.66 ± 0.05 (r +,r )=(0.2,0.4) 4.91 ± 0.09 10.52 ± 0.13 None 0.92 ± 0.00 3.63 ± 0.00 (r +,r )=(0.1,0.0) 0.95 ± 0.01 3.56 ± 0.01 (r +,r )=(0.1,0.2) 0.97 ± 0.01 3.63 ± 0.02 (r +,r )=(0.2,0.4) 1.17 ± 0.02 4.06 ± 0.03 53 / 57

Conclusion 54 / 57

Learning from corrupted binary labels Monotone relationship h(x)=f a,b,p (h(x)) facilitates: Range of ĥ noise estimator Omit for BER ĥ â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 55 / 57

Future work Better noise estimators in special cases? c.f. (Elkan and Noto, 2008) when D separable Fusion with loss transfer (Natarajan et al., 2013) approach assumes noise rates known better for misspecified models? I c.f. non-robustness of convex surrogate minimisation 56 / 57

Thanks! 1 1 Drop by the poster for more (Paper ID 69) 57 / 57