Learning from Corrupted Binary Labels via Class-Probability Estimation

Size: px
Start display at page:

Download "Learning from Corrupted Binary Labels via Class-Probability Estimation"

Transcription

1 Learning from Corrupted Binary Labels via Class-Probability Estimation Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson xxx National ICT Australia and The Australian National University 1 / 57

2 Learning from binary labels +" +" +" #" +" #" #" #" 2 / 57

3 Learning from binary labels?" +" #" #" #" +" +" +" #" 3 / 57

4 Learning from binary labels +" +" +" #" +" #" #" #" 4 / 57

5 Learning from noisy labels #" #" +" +" +" #" +" #" 5 / 57

6 Learning from positive and unlabelled data?"?" +"?" +"?"?"?" 6 / 57

7 Learning from binary labels +" +" +" #" +" #" #" #" nature S D n learner Goal: good classification wrt distribution D 7 / 57

8 Learning from corrupted labels #" #" +" #" #" +" +" +" S D n S D nature corruptor n learner Goal: good classification wrt (unobserved) distribution D 8 / 57

9 Paper summary Can we learn a good classifier from corrupted samples? 9 / 57

10 Paper summary Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes! 10 / 57

11 Paper summary Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes! can treat samples as if uncorrupted! (Elkan and Noto, 2008), (Zhang and Lee, 2008), (Natarajan et al., 2013), (duplessis and Sugiyama, 2014) / 57

12 Paper summary Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes! can treat samples as if uncorrupted! (Elkan and Noto, 2008), (Zhang and Lee, 2008), (Natarajan et al., 2013), (duplessis and Sugiyama, 2014)... This work: unified treatment via class-probability estimation analysis for general class of corruptions 12 / 57

13 Assumed corruption model 13 / 57

14 Learning from binary labels: distributions Fix instance space X (e.g. R N ) Underlying distribution D over X {±1} Constituent components of D: (P(x),Q(x),p)=(P[X = x Y = 1],P[X = x Y = 1],P[Y = 1]) 14 / 57

15 Learning from binary labels: distributions Fix instance space X (e.g. R N ) Underlying distribution D over X {±1} Constituent components of D: (P(x),Q(x),p)=(P[X = x Y = 1],P[X = x Y = 1],P[Y = 1]) (M(x),h(x)) =(P[X = x],p[y = 1 X = x]) 15 / 57

16 Learning from corrupted binary labels S D n S D nature corruptor n learner Samples from corrupted distribution D =(P, Q, p) Goal: good classification wrt (unobserved) distribution D 16 / 57

17 Learning from corrupted binary labels S D n S D nature corruptor n learner Samples from corrupted distribution D =(P, Q, p), where P =(1 a) P + a Q and p is arbitrary a,b are noise rates Q = b P +(1 b) Q mutually contaminated distributions (Scott et al., 2013) Goal: good classification wrt (unobserved) distribution D 17 / 57

18 Special cases Label noise Labels flipped w.p. r PU learning Observe M instead of Q p = (1 p = arbitrary a =p b = (1 2r) p + r 1 (1 p) p) r 1 p r #" #" +" +" P = 1 P+0 Q Q=M = p P + (1 p) Q?"?" +" #" #" +"?"?" +"?" +"?" 18 / 57

19 Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis 19 / 57

20 Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis Proposition For any D,D, h(x)=f a,b,p (h(x)) where f a,b,p is strictly monotone for fixed a,b,p. 20 / 57

21 Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis Proposition For any D,D, h(x)=f a,b,p (h(x)) where f a,b,p is strictly monotone for fixed a,b,p. Follows from Bayes rule: h(x) 1 h(x) = p 1 p P(x) Q(x) 21 / 57

22 Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis Proposition For any D,D, h(x)=f a,b,p (h(x)) where f a,b,p is strictly monotone for fixed a,b,p. Follows from Bayes rule: h(x) 1 h(x) = p 1 p P(x) Q(x) = p 1 p (1 a) b P(x) Q(x) P(x) Q(x) + a. +(1 b) 22 / 57

23 Corrupted class-probabilities: special cases Label noise PU learning h(x) =(1 2r) h(x) +r h(x) = p h(x) p h(x)+(1 p) p r unknown p unknown (Natarajan et al., 2013) (Ward et al., 2009) 23 / 57

24 Roadmap nature D corruptor D class-prob estimator ĥ classifier Kernel logistic regression 24 / 57

25 Roadmap Exploit monotone relationship between h and h nature D corruptor D class-prob estimator ĥ classifier? Kernel logistic regression 25 / 57

26 Classification with noise rates 26 / 57

27 Class-probabilities and classification Many classification measures optimised by sign(h(x) t) 0-1 error! t = 1 2 Balanced error! t = p F-score! optimal t depends on D I (Lipton et al., 2014, Koyejo et al., 2014) 27 / 57

28 Class-probabilities and classification Many classification measures optimised by sign(h(x) t) 0-1 error! t = 1 2 Balanced error! t = p F-score! optimal t depends on D I (Lipton et al., 2014, Koyejo et al., 2014) We can relate this to thresholding of h! 28 / 57

29 Corrupted class-probabilities and classification By monotone relationship, h(x) > t () h(x) > f a,b,p (t). Threshold h at f a,b,p (t)! optimal classification on D Can translate into regret bound e.g. for 0-1 loss 29 / 57

30 Story so far Classification scheme requires: h t a,b,p noise oracle â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) 30 / 57

31 Story so far Classification scheme requires: h! class-probability estimation t a,b,p noise oracle â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 31 / 57

32 Story so far Classification scheme requires: h! class-probability estimation t! if unknown, alternate approach (see poster) a,b,p noise oracle â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 32 / 57

33 Story so far Classification scheme requires: h! class-probability estimation t! if unknown, alternate approach (see poster) a,b,p! can we estimate these? noise estimator? â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 33 / 57

34 Estimating noise rates: some bad news p strongly non-identifiable! p allowed to be arbitrary (e.g. PU learning) a, b non-identifiable without assumptions (Scott et al., 2013) Can we estimate a, b under assumptions? 34 / 57

35 Weak separability assumption Assume that D is weakly separable : min h(x)=0 x2x max x2x h(x)=1 i.e. 9 deterministically + ve and - ve instances weaker than full separability 35 / 57

36 Weak separability assumption Assume that D is weakly separable : min h(x)=0 x2x max x2x h(x)=1 i.e. 9 deterministically + ve and - ve instances weaker than full separability Assumed range of h constrains observed range of h! 36 / 57

37 Estimating noise rates Proposition Pick any weakly separable D. Then, for any D, a = h min (h max p) p (h max h min ) and b = (1 h max) (p h min ) (1 p) (h max h min ) where h min = min x2x h(x) h max = max x2x h(x) a,b can be estimated from corrupted data alone 37 / 57

38 Estimating noise rates: special cases Label noise r = 1 = h min h max p = p h min h max h min PU learning a = 0 b = p = 1 h max p h max 1 p (Elkan and Noto, 2008), (Liu and Tao, 2014) c.f. mixture proportion estimate of (Scott et al., 2013) In these cases, p can be estimated as well 38 / 57

39 Story so far Optimal classification in general requires a, b, p Range of ĥ noise estimator ĥ â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 39 / 57

40 Story so far Optimal classification in general requires a, b, p when does f a,b,p (t) not depend on a,b,p? Range of ĥ noise estimator ĥ â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 40 / 57

41 Classification without noise rates 41 / 57

42 Balanced error (BER) of classifier Balanced error (BER) of a classifier f : X! {±1} is: BER D (f )= FPRD (f )+FNR D (f ) 2 for false positive and negative rates FPR D (f ),FNR D (f ) average classification performance on each class optimal classifier is sign(h(x) p) 42 / 57

43 BER immunity under corruption Proposition (c.f. (Zhang and Lee, 2008)) For any D,D, and classifier f : X! {±1}, BER D (f )=(1 a b) BER D (f )+ a + b 2 43 / 57

44 BER immunity under corruption Proposition (c.f. (Zhang and Lee, 2008)) For any D,D, and classifier f : X! {±1}, BER D (f )=(1 a b) BER D (f )+ a + b 2 BER-optimal classifiers on clean and corrupted coincide sign(h(x) p)=sign(h(x) p) 44 / 57

45 BER immunity under corruption Proposition (c.f. (Zhang and Lee, 2008)) For any D,D, and classifier f : X! {±1}, BER D (f )=(1 a b) BER D (f )+ a + b 2 BER-optimal classifiers on clean and corrupted coincide sign(h(x) p)=sign(h(x) p) Minimise clean BER! don t need to know corruption rates! threshold on h does not need a,b,p 45 / 57

46 BER immunity & class-probability estimation Trivially, we also have regret D BER (f )=(1 a b) 1 regret D BER (f ). i.e. good corrupted BER =) good clean BER can make regret D BER (f )! 0 by class-probability estimation Similar result for AUC (see poster) 46 / 57

47 BER immunity under corruption: proof From (Scott et al., 2013), h T FPR D (f ) FNR D (f )i = FPR D (f ) FNR D (f ) apple T 1 b a b 1 a + b a T, 47 / 57

48 BER immunity under corruption: proof From (Scott et al., 2013), h T FPR D (f ) FNR D (f )i = FPR D (f ) FNR D (f ) apple T 1 b a b 1 a + b a T, and apple 1 1 is an eigenvector of apple 1 b a b 1 a 48 / 57

49 Are other measures immune? BER is only (non-trivial) performance measure for which: corrupted risk = affine transform of clean risk I because of eigenvector interpretation corrupted threshold is independent of a,b,p I because of nature of f a,b,p (see poster) Other performance measures! need (one of) a,b,p 49 / 57

50 Experiments 50 / 57

51 Experimental setup Injected label noise on UCI datasets Estimate corrupted class-probabilities via neural network well-specified if D linearly separable: h(x)=s(hw,xi) =) h(x)=a s(hw,xi)+b Evaluate: reliability of noise estimates BER performance on clean test set I corrupted data used for training and validation 0-1 performance on clean test set (see poster) 51 / 57

52 Experimental results: noise rates Estimated noise rates are generally reliable Bias of Estimate segment Mean Median Bias of Estimate Ground truth noise mnist spambase Mean Median Ground truth noise Bias of Estimate Mean Median Ground truth noise 52 / 57

53 Experimental results: BER immunity Generally, low observed degradation in BER Dataset Noise 1 - AUC (%) BER (%) segment spambase mnist None 0.00 ± ± 0.00 (r +,r )=(0.1,0.0) 0.00 ± ± 0.00 (r +,r )=(0.1,0.2) 0.02 ± ± 0.08 (r +,r )=(0.2,0.4) 0.03 ± ± 0.20 None 2.49 ± ± 0.00 (r +,r )=(0.1,0.0) 2.67 ± ± 0.03 (r +,r )=(0.1,0.2) 3.01 ± ± 0.05 (r +,r )=(0.2,0.4) 4.91 ± ± 0.13 None 0.92 ± ± 0.00 (r +,r )=(0.1,0.0) 0.95 ± ± 0.01 (r +,r )=(0.1,0.2) 0.97 ± ± 0.02 (r +,r )=(0.2,0.4) 1.17 ± ± / 57

54 Conclusion 54 / 57

55 Learning from corrupted binary labels Monotone relationship h(x)=f a,b,p (h(x)) facilitates: Range of ĥ noise estimator Omit for BER ĥ â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 55 / 57

56 Future work Better noise estimators in special cases? c.f. (Elkan and Noto, 2008) when D separable Fusion with loss transfer (Natarajan et al., 2013) approach assumes noise rates known better for misspecified models? I c.f. non-robustness of convex surrogate minimisation 56 / 57

57 Thanks! 1 1 Drop by the poster for more (Paper ID 69) 57 / 57

Learning from Corrupted Binary Labels via Class-Probability Estimation (Full Version)

Learning from Corrupted Binary Labels via Class-Probability Estimation (Full Version) (Full Version) Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson National ICT Australia and The Australian National University, Canberra The Australian National University and

More information

Classification objectives COMS 4771

Classification objectives COMS 4771 Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Logistic Regression Trained with Different Loss Functions. Discussion

Logistic Regression Trained with Different Loss Functions. Discussion Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =

More information

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

More information

Instance-Dependent PU Learning by Bayesian Optimal Relabeling

Instance-Dependent PU Learning by Bayesian Optimal Relabeling Instance-Dependent PU Learning by Bayesian Optimal Relabeling arxiv:1808.02180v1 [cs.lg] 7 Aug 2018 Fengxiang He Tongliang Liu Geoffrey I Webb Dacheng Tao Abstract When learning from positive and unlabelled

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Class Prior Estimation from Positive and Unlabeled Data

Class Prior Estimation from Positive and Unlabeled Data IEICE Transactions on Information and Systems, vol.e97-d, no.5, pp.1358 1362, 2014. 1 Class Prior Estimation from Positive and Unlabeled Data Marthinus Christoffel du Plessis Tokyo Institute of Technology,

More information

Performance Evaluation

Performance Evaluation Performance Evaluation David S. Rosenberg Bloomberg ML EDU October 26, 2017 David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 1 / 36 Baseline Models David S. Rosenberg (Bloomberg ML EDU) October 26,

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 4, 2015 Today: Generative discriminative classifiers Linear regression Decomposition of error into

More information

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1 Decision Trees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 5 th, 2007 2005-2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating

More information

Machine Learning

Machine Learning Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 1, 2011 Today: Generative discriminative classifiers Linear regression Decomposition of error into

More information

Midterm, Fall 2003

Midterm, Fall 2003 5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

Lecture 3 Classification, Logistic Regression

Lecture 3 Classification, Logistic Regression Lecture 3 Classification, Logistic Regression Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se F. Lindsten Summary

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements HW3 due date as posted. make sure

More information

arxiv: v3 [stat.ml] 9 Aug 2017

arxiv: v3 [stat.ml] 9 Aug 2017 Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels arxiv:1705.01936v3 [stat.ml] 9 Aug 2017 Curtis G. Northcutt, Tailin Wu, Isaac L. Chuang Massachusetts Institute

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning

More information

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes November 9, 2012 1 / 1 Nearest centroid rule Suppose we break down our data matrix as by the labels yielding (X

More information

outline Nonlinear transformation Error measures Noisy targets Preambles to the theory

outline Nonlinear transformation Error measures Noisy targets Preambles to the theory Error and Noise outline Nonlinear transformation Error measures Noisy targets Preambles to the theory Linear is limited Data Hypothesis Linear in what? Linear regression implements Linear classification

More information

Final Exam, Fall 2002

Final Exam, Fall 2002 15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good

More information

Qualifier: CS 6375 Machine Learning Spring 2015

Qualifier: CS 6375 Machine Learning Spring 2015 Qualifier: CS 6375 Machine Learning Spring 2015 The exam is closed book. You are allowed to use two double-sided cheat sheets and a calculator. If you run out of room for an answer, use an additional sheet

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

CSE 546 Final Exam, Autumn 2013

CSE 546 Final Exam, Autumn 2013 CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book,

More information

Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels

Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels Curtis G. Northcutt, Tailin Wu, Isaac L. Chuang Massachusetts Institute of Technology Cambridge, MA 239 Abstract

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation Harikrishna Narasimhan Shivani Agarwal epartment of Computer Science and Automation Indian

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Beyond Accuracy: Scalable Classification with Complex Metrics

Beyond Accuracy: Scalable Classification with Complex Metrics Beyond Accuracy: Scalable Classification with Complex Metrics Sanmi Koyejo University of Illinois at Urbana-Champaign Joint work with B. Yan @UT Austin K. Zhong @UT Austin P. Ravikumar @CMU N. Natarajan

More information

Machine Learning, Midterm Exam

Machine Learning, Midterm Exam 10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Consistent Binary Classification with Generalized Performance Metrics

Consistent Binary Classification with Generalized Performance Metrics Consistent Binary Classification with Generalized Performance Metrics Oluwasanmi Koyejo Department of Psychology, Stanford University sanmi@stanford.edu Pradeep Ravikumar Department of Computer Science,

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017 Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1. What are selective classifiers? 2. The Realizable Setting 3. The Noisy Setting 1 What are selective classifiers?

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Cross-Modal Retrieval: A Pairwise Classification Approach

Cross-Modal Retrieval: A Pairwise Classification Approach Cross-Modal Retrieval: A Pairwise Classification Approach Aditya Krishna Menon 1,2, Didi Surian 1,3 and Sanjay Chawla 3,4 1 National ICT Australia, Australia 2 Australian National University, Australia

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine Learning, Midterm Exam: Spring 2009 SOLUTION 10-601 Machine Learning, Midterm Exam: Spring 2009 SOLUTION March 4, 2009 Please put your name at the top of the table below. If you need more room to work out your answer to a question, use the back of

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis Week 5 Based in part on slides from textbook, slides of Susan Holmes Part I Linear Discriminant Analysis October 29, 2012 1 / 1 2 / 1 Nearest centroid rule Suppose we break down our data matrix as by the

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Pointwise Exact Bootstrap Distributions of Cost Curves

Pointwise Exact Bootstrap Distributions of Cost Curves Pointwise Exact Bootstrap Distributions of Cost Curves Charles Dugas and David Gadoury University of Montréal 25th ICML Helsinki July 2008 Dugas, Gadoury (U Montréal) Cost curves July 8, 2008 1 / 24 Outline

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014 Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

Loss Functions, Decision Theory, and Linear Models

Loss Functions, Decision Theory, and Linear Models Loss Functions, Decision Theory, and Linear Models CMSC 678 UMBC January 31 st, 2018 Some slides adapted from Hamed Pirsiavash Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2018/cmsc678

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Optimization geometry and implicit regularization

Optimization geometry and implicit regularization Optimization geometry and implicit regularization Suriya Gunasekar Joint work with N. Srebro (TTIC), J. Lee (USC), D. Soudry (Technion), M.S. Nacson (Technion), B. Woodworth (TTIC), S. Bhojanapalli (TTIC),

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Methods and Criteria for Model Selection CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Introduce classifier evaluation criteria } Introduce Bias x Variance duality } Model Assessment }

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Tufts COMP 135: Introduction to Machine Learning

Tufts COMP 135: Introduction to Machine Learning Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels

Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels Curtis G. Northcutt, Tailin Wu, Isaac L. Chuang Massachusetts Institute of Technology Cambridge, MA 0239 Abstract

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released

More information

Class-prior Estimation for Learning from Positive and Unlabeled Data

Class-prior Estimation for Learning from Positive and Unlabeled Data JMLR: Workshop and Conference Proceedings 45:22 236, 25 ACML 25 Class-prior Estimation for Learning from Positive and Unlabeled Data Marthinus Christoffel du Plessis Gang Niu Masashi Sugiyama The University

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Krzysztof Dembczyński and Wojciech Kot lowski Intelligent Decision Support Systems Laboratory (IDSS) Poznań

More information

Kernel Logistic Regression and the Import Vector Machine

Kernel Logistic Regression and the Import Vector Machine Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao

More information

Week 5: Logistic Regression & Neural Networks

Week 5: Logistic Regression & Neural Networks Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and

More information

Data Privacy in Biomedicine. Lecture 11b: Performance Measures for System Evaluation

Data Privacy in Biomedicine. Lecture 11b: Performance Measures for System Evaluation Data Privacy in Biomedicine Lecture 11b: Performance Measures for System Evaluation Bradley Malin, PhD (b.malin@vanderbilt.edu) Professor of Biomedical Informatics, Biostatistics, & Computer Science Vanderbilt

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

ABC-LogitBoost for Multi-Class Classification

ABC-LogitBoost for Multi-Class Classification Ping Li, Cornell University ABC-Boost BTRY 6520 Fall 2012 1 ABC-LogitBoost for Multi-Class Classification Ping Li Department of Statistical Science Cornell University 2 4 6 8 10 12 14 16 2 4 6 8 10 12

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables? Linear Regression Machine Learning CSE546 Sham Kakade University of Washington Oct 4, 2016 1 What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do

More information