Learning from Corrupted Binary Labels via Class-Probability Estimation
|
|
- Arron Miles
- 5 years ago
- Views:
Transcription
1 Learning from Corrupted Binary Labels via Class-Probability Estimation Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson xxx National ICT Australia and The Australian National University 1 / 57
2 Learning from binary labels +" +" +" #" +" #" #" #" 2 / 57
3 Learning from binary labels?" +" #" #" #" +" +" +" #" 3 / 57
4 Learning from binary labels +" +" +" #" +" #" #" #" 4 / 57
5 Learning from noisy labels #" #" +" +" +" #" +" #" 5 / 57
6 Learning from positive and unlabelled data?"?" +"?" +"?"?"?" 6 / 57
7 Learning from binary labels +" +" +" #" +" #" #" #" nature S D n learner Goal: good classification wrt distribution D 7 / 57
8 Learning from corrupted labels #" #" +" #" #" +" +" +" S D n S D nature corruptor n learner Goal: good classification wrt (unobserved) distribution D 8 / 57
9 Paper summary Can we learn a good classifier from corrupted samples? 9 / 57
10 Paper summary Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes! 10 / 57
11 Paper summary Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes! can treat samples as if uncorrupted! (Elkan and Noto, 2008), (Zhang and Lee, 2008), (Natarajan et al., 2013), (duplessis and Sugiyama, 2014) / 57
12 Paper summary Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes! can treat samples as if uncorrupted! (Elkan and Noto, 2008), (Zhang and Lee, 2008), (Natarajan et al., 2013), (duplessis and Sugiyama, 2014)... This work: unified treatment via class-probability estimation analysis for general class of corruptions 12 / 57
13 Assumed corruption model 13 / 57
14 Learning from binary labels: distributions Fix instance space X (e.g. R N ) Underlying distribution D over X {±1} Constituent components of D: (P(x),Q(x),p)=(P[X = x Y = 1],P[X = x Y = 1],P[Y = 1]) 14 / 57
15 Learning from binary labels: distributions Fix instance space X (e.g. R N ) Underlying distribution D over X {±1} Constituent components of D: (P(x),Q(x),p)=(P[X = x Y = 1],P[X = x Y = 1],P[Y = 1]) (M(x),h(x)) =(P[X = x],p[y = 1 X = x]) 15 / 57
16 Learning from corrupted binary labels S D n S D nature corruptor n learner Samples from corrupted distribution D =(P, Q, p) Goal: good classification wrt (unobserved) distribution D 16 / 57
17 Learning from corrupted binary labels S D n S D nature corruptor n learner Samples from corrupted distribution D =(P, Q, p), where P =(1 a) P + a Q and p is arbitrary a,b are noise rates Q = b P +(1 b) Q mutually contaminated distributions (Scott et al., 2013) Goal: good classification wrt (unobserved) distribution D 17 / 57
18 Special cases Label noise Labels flipped w.p. r PU learning Observe M instead of Q p = (1 p = arbitrary a =p b = (1 2r) p + r 1 (1 p) p) r 1 p r #" #" +" +" P = 1 P+0 Q Q=M = p P + (1 p) Q?"?" +" #" #" +"?"?" +"?" +"?" 18 / 57
19 Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis 19 / 57
20 Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis Proposition For any D,D, h(x)=f a,b,p (h(x)) where f a,b,p is strictly monotone for fixed a,b,p. 20 / 57
21 Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis Proposition For any D,D, h(x)=f a,b,p (h(x)) where f a,b,p is strictly monotone for fixed a,b,p. Follows from Bayes rule: h(x) 1 h(x) = p 1 p P(x) Q(x) 21 / 57
22 Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis Proposition For any D,D, h(x)=f a,b,p (h(x)) where f a,b,p is strictly monotone for fixed a,b,p. Follows from Bayes rule: h(x) 1 h(x) = p 1 p P(x) Q(x) = p 1 p (1 a) b P(x) Q(x) P(x) Q(x) + a. +(1 b) 22 / 57
23 Corrupted class-probabilities: special cases Label noise PU learning h(x) =(1 2r) h(x) +r h(x) = p h(x) p h(x)+(1 p) p r unknown p unknown (Natarajan et al., 2013) (Ward et al., 2009) 23 / 57
24 Roadmap nature D corruptor D class-prob estimator ĥ classifier Kernel logistic regression 24 / 57
25 Roadmap Exploit monotone relationship between h and h nature D corruptor D class-prob estimator ĥ classifier? Kernel logistic regression 25 / 57
26 Classification with noise rates 26 / 57
27 Class-probabilities and classification Many classification measures optimised by sign(h(x) t) 0-1 error! t = 1 2 Balanced error! t = p F-score! optimal t depends on D I (Lipton et al., 2014, Koyejo et al., 2014) 27 / 57
28 Class-probabilities and classification Many classification measures optimised by sign(h(x) t) 0-1 error! t = 1 2 Balanced error! t = p F-score! optimal t depends on D I (Lipton et al., 2014, Koyejo et al., 2014) We can relate this to thresholding of h! 28 / 57
29 Corrupted class-probabilities and classification By monotone relationship, h(x) > t () h(x) > f a,b,p (t). Threshold h at f a,b,p (t)! optimal classification on D Can translate into regret bound e.g. for 0-1 loss 29 / 57
30 Story so far Classification scheme requires: h t a,b,p noise oracle â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) 30 / 57
31 Story so far Classification scheme requires: h! class-probability estimation t a,b,p noise oracle â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 31 / 57
32 Story so far Classification scheme requires: h! class-probability estimation t! if unknown, alternate approach (see poster) a,b,p noise oracle â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 32 / 57
33 Story so far Classification scheme requires: h! class-probability estimation t! if unknown, alternate approach (see poster) a,b,p! can we estimate these? noise estimator? â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 33 / 57
34 Estimating noise rates: some bad news p strongly non-identifiable! p allowed to be arbitrary (e.g. PU learning) a, b non-identifiable without assumptions (Scott et al., 2013) Can we estimate a, b under assumptions? 34 / 57
35 Weak separability assumption Assume that D is weakly separable : min h(x)=0 x2x max x2x h(x)=1 i.e. 9 deterministically + ve and - ve instances weaker than full separability 35 / 57
36 Weak separability assumption Assume that D is weakly separable : min h(x)=0 x2x max x2x h(x)=1 i.e. 9 deterministically + ve and - ve instances weaker than full separability Assumed range of h constrains observed range of h! 36 / 57
37 Estimating noise rates Proposition Pick any weakly separable D. Then, for any D, a = h min (h max p) p (h max h min ) and b = (1 h max) (p h min ) (1 p) (h max h min ) where h min = min x2x h(x) h max = max x2x h(x) a,b can be estimated from corrupted data alone 37 / 57
38 Estimating noise rates: special cases Label noise r = 1 = h min h max p = p h min h max h min PU learning a = 0 b = p = 1 h max p h max 1 p (Elkan and Noto, 2008), (Liu and Tao, 2014) c.f. mixture proportion estimate of (Scott et al., 2013) In these cases, p can be estimated as well 38 / 57
39 Story so far Optimal classification in general requires a, b, p Range of ĥ noise estimator ĥ â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 39 / 57
40 Story so far Optimal classification in general requires a, b, p when does f a,b,p (t) not depend on a,b,p? Range of ĥ noise estimator ĥ â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 40 / 57
41 Classification without noise rates 41 / 57
42 Balanced error (BER) of classifier Balanced error (BER) of a classifier f : X! {±1} is: BER D (f )= FPRD (f )+FNR D (f ) 2 for false positive and negative rates FPR D (f ),FNR D (f ) average classification performance on each class optimal classifier is sign(h(x) p) 42 / 57
43 BER immunity under corruption Proposition (c.f. (Zhang and Lee, 2008)) For any D,D, and classifier f : X! {±1}, BER D (f )=(1 a b) BER D (f )+ a + b 2 43 / 57
44 BER immunity under corruption Proposition (c.f. (Zhang and Lee, 2008)) For any D,D, and classifier f : X! {±1}, BER D (f )=(1 a b) BER D (f )+ a + b 2 BER-optimal classifiers on clean and corrupted coincide sign(h(x) p)=sign(h(x) p) 44 / 57
45 BER immunity under corruption Proposition (c.f. (Zhang and Lee, 2008)) For any D,D, and classifier f : X! {±1}, BER D (f )=(1 a b) BER D (f )+ a + b 2 BER-optimal classifiers on clean and corrupted coincide sign(h(x) p)=sign(h(x) p) Minimise clean BER! don t need to know corruption rates! threshold on h does not need a,b,p 45 / 57
46 BER immunity & class-probability estimation Trivially, we also have regret D BER (f )=(1 a b) 1 regret D BER (f ). i.e. good corrupted BER =) good clean BER can make regret D BER (f )! 0 by class-probability estimation Similar result for AUC (see poster) 46 / 57
47 BER immunity under corruption: proof From (Scott et al., 2013), h T FPR D (f ) FNR D (f )i = FPR D (f ) FNR D (f ) apple T 1 b a b 1 a + b a T, 47 / 57
48 BER immunity under corruption: proof From (Scott et al., 2013), h T FPR D (f ) FNR D (f )i = FPR D (f ) FNR D (f ) apple T 1 b a b 1 a + b a T, and apple 1 1 is an eigenvector of apple 1 b a b 1 a 48 / 57
49 Are other measures immune? BER is only (non-trivial) performance measure for which: corrupted risk = affine transform of clean risk I because of eigenvector interpretation corrupted threshold is independent of a,b,p I because of nature of f a,b,p (see poster) Other performance measures! need (one of) a,b,p 49 / 57
50 Experiments 50 / 57
51 Experimental setup Injected label noise on UCI datasets Estimate corrupted class-probabilities via neural network well-specified if D linearly separable: h(x)=s(hw,xi) =) h(x)=a s(hw,xi)+b Evaluate: reliability of noise estimates BER performance on clean test set I corrupted data used for training and validation 0-1 performance on clean test set (see poster) 51 / 57
52 Experimental results: noise rates Estimated noise rates are generally reliable Bias of Estimate segment Mean Median Bias of Estimate Ground truth noise mnist spambase Mean Median Ground truth noise Bias of Estimate Mean Median Ground truth noise 52 / 57
53 Experimental results: BER immunity Generally, low observed degradation in BER Dataset Noise 1 - AUC (%) BER (%) segment spambase mnist None 0.00 ± ± 0.00 (r +,r )=(0.1,0.0) 0.00 ± ± 0.00 (r +,r )=(0.1,0.2) 0.02 ± ± 0.08 (r +,r )=(0.2,0.4) 0.03 ± ± 0.20 None 2.49 ± ± 0.00 (r +,r )=(0.1,0.0) 2.67 ± ± 0.03 (r +,r )=(0.1,0.2) 3.01 ± ± 0.05 (r +,r )=(0.2,0.4) 4.91 ± ± 0.13 None 0.92 ± ± 0.00 (r +,r )=(0.1,0.0) 0.95 ± ± 0.01 (r +,r )=(0.1,0.2) 0.97 ± ± 0.02 (r +,r )=(0.2,0.4) 1.17 ± ± / 57
54 Conclusion 54 / 57
55 Learning from corrupted binary labels Monotone relationship h(x)=f a,b,p (h(x)) facilitates: Range of ĥ noise estimator Omit for BER ĥ â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 55 / 57
56 Future work Better noise estimators in special cases? c.f. (Elkan and Noto, 2008) when D separable Fusion with loss transfer (Natarajan et al., 2013) approach assumes noise rates known better for misspecified models? I c.f. non-robustness of convex surrogate minimisation 56 / 57
57 Thanks! 1 1 Drop by the poster for more (Paper ID 69) 57 / 57
Learning from Corrupted Binary Labels via Class-Probability Estimation (Full Version)
(Full Version) Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson National ICT Australia and The Australian National University, Canberra The Australian National University and
More informationClassification objectives COMS 4771
Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationLogistic Regression Trained with Different Loss Functions. Discussion
Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =
More informationLecture 4 Discriminant Analysis, k-nearest Neighbors
Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se
More informationInstance-Dependent PU Learning by Bayesian Optimal Relabeling
Instance-Dependent PU Learning by Bayesian Optimal Relabeling arxiv:1808.02180v1 [cs.lg] 7 Aug 2018 Fengxiang He Tongliang Liu Geoffrey I Webb Dacheng Tao Abstract When learning from positive and unlabelled
More informationSUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION
SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationClass Prior Estimation from Positive and Unlabeled Data
IEICE Transactions on Information and Systems, vol.e97-d, no.5, pp.1358 1362, 2014. 1 Class Prior Estimation from Positive and Unlabeled Data Marthinus Christoffel du Plessis Tokyo Institute of Technology,
More informationPerformance Evaluation
Performance Evaluation David S. Rosenberg Bloomberg ML EDU October 26, 2017 David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 1 / 36 Baseline Models David S. Rosenberg (Bloomberg ML EDU) October 26,
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 4, 2015 Today: Generative discriminative classifiers Linear regression Decomposition of error into
More informationDecision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1
Decision Trees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 5 th, 2007 2005-2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating
More informationMachine Learning
Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 1, 2011 Today: Generative discriminative classifiers Linear regression Decomposition of error into
More informationMidterm, Fall 2003
5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are
More informationLecture 3 Classification, Logistic Regression
Lecture 3 Classification, Logistic Regression Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se F. Lindsten Summary
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationMidterm: CS 6375 Spring 2015 Solutions
Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an
More informationStephen Scott.
1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationMIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,
MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run
More informationVBM683 Machine Learning
VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationClass 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio
Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant
More informationMachine Learning (CSE 446): Multi-Class Classification; Kernel Methods
Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements HW3 due date as posted. make sure
More informationarxiv: v3 [stat.ml] 9 Aug 2017
Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels arxiv:1705.01936v3 [stat.ml] 9 Aug 2017 Curtis G. Northcutt, Tailin Wu, Isaac L. Chuang Massachusetts Institute
More informationNaïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning
More informationLinear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining
Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes November 9, 2012 1 / 1 Nearest centroid rule Suppose we break down our data matrix as by the labels yielding (X
More informationoutline Nonlinear transformation Error measures Noisy targets Preambles to the theory
Error and Noise outline Nonlinear transformation Error measures Noisy targets Preambles to the theory Linear is limited Data Hypothesis Linear in what? Linear regression implements Linear classification
More informationFinal Exam, Fall 2002
15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationRegularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline
Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function
More informationLearning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin
Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good
More informationQualifier: CS 6375 Machine Learning Spring 2015
Qualifier: CS 6375 Machine Learning Spring 2015 The exam is closed book. You are allowed to use two double-sided cheat sheets and a calculator. If you run out of room for an answer, use an additional sheet
More informationReducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning
More informationCSE 546 Final Exam, Autumn 2013
CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book,
More informationLearning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels
Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels Curtis G. Northcutt, Tailin Wu, Isaac L. Chuang Massachusetts Institute of Technology Cambridge, MA 239 Abstract
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationOn the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation
On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation Harikrishna Narasimhan Shivani Agarwal epartment of Computer Science and Automation Indian
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationIntroduction to Bayesian Learning. Machine Learning Fall 2018
Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationBeyond Accuracy: Scalable Classification with Complex Metrics
Beyond Accuracy: Scalable Classification with Complex Metrics Sanmi Koyejo University of Illinois at Urbana-Champaign Joint work with B. Yan @UT Austin K. Zhong @UT Austin P. Ravikumar @CMU N. Natarajan
More informationMachine Learning, Midterm Exam
10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationConsistent Binary Classification with Generalized Performance Metrics
Consistent Binary Classification with Generalized Performance Metrics Oluwasanmi Koyejo Department of Psychology, Stanford University sanmi@stanford.edu Pradeep Ravikumar Department of Computer Science,
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationSelective Prediction. Binary classifications. Rong Zhou November 8, 2017
Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1. What are selective classifiers? 2. The Realizable Setting 3. The Noisy Setting 1 What are selective classifiers?
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More informationCross-Modal Retrieval: A Pairwise Classification Approach
Cross-Modal Retrieval: A Pairwise Classification Approach Aditya Krishna Menon 1,2, Didi Surian 1,3 and Sanjay Chawla 3,4 1 National ICT Australia, Australia 2 Australian National University, Australia
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationMachine Learning, Midterm Exam: Spring 2009 SOLUTION
10-601 Machine Learning, Midterm Exam: Spring 2009 SOLUTION March 4, 2009 Please put your name at the top of the table below. If you need more room to work out your answer to a question, use the back of
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationPart I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis
Week 5 Based in part on slides from textbook, slides of Susan Holmes Part I Linear Discriminant Analysis October 29, 2012 1 / 1 2 / 1 Nearest centroid rule Suppose we break down our data matrix as by the
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationPointwise Exact Bootstrap Distributions of Cost Curves
Pointwise Exact Bootstrap Distributions of Cost Curves Charles Dugas and David Gadoury University of Montréal 25th ICML Helsinki July 2008 Dugas, Gadoury (U Montréal) Cost curves July 8, 2008 1 / 24 Outline
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationDecision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014
Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationA Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007
Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized
More informationActive Learning and Optimized Information Gathering
Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office
More informationLoss Functions, Decision Theory, and Linear Models
Loss Functions, Decision Theory, and Linear Models CMSC 678 UMBC January 31 st, 2018 Some slides adapted from Hamed Pirsiavash Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2018/cmsc678
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationOptimization geometry and implicit regularization
Optimization geometry and implicit regularization Suriya Gunasekar Joint work with N. Srebro (TTIC), J. Lee (USC), D. Soudry (Technion), M.S. Nacson (Technion), B. Woodworth (TTIC), S. Bhojanapalli (TTIC),
More informationDiscrete Mathematics and Probability Theory Fall 2015 Lecture 21
CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationMethods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro
Methods and Criteria for Model Selection CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Introduce classifier evaluation criteria } Introduce Bias x Variance duality } Model Assessment }
More informationECE521 Lecture7. Logistic Regression
ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationTufts COMP 135: Introduction to Machine Learning
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)
More informationFinal Exam, Machine Learning, Spring 2009
Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationLearning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels
Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels Curtis G. Northcutt, Tailin Wu, Isaac L. Chuang Massachusetts Institute of Technology Cambridge, MA 0239 Abstract
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationCOMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma
COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released
More informationClass-prior Estimation for Learning from Positive and Unlabeled Data
JMLR: Workshop and Conference Proceedings 45:22 236, 25 ACML 25 Class-prior Estimation for Learning from Positive and Unlabeled Data Marthinus Christoffel du Plessis Gang Niu Masashi Sugiyama The University
More informationBayesian Methods: Naïve Bayes
Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior
More informationDEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE
Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures
More informationBinary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach
Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Krzysztof Dembczyński and Wojciech Kot lowski Intelligent Decision Support Systems Laboratory (IDSS) Poznań
More informationKernel Logistic Regression and the Import Vector Machine
Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao
More informationWeek 5: Logistic Regression & Neural Networks
Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and
More informationData Privacy in Biomedicine. Lecture 11b: Performance Measures for System Evaluation
Data Privacy in Biomedicine Lecture 11b: Performance Measures for System Evaluation Bradley Malin, PhD (b.malin@vanderbilt.edu) Professor of Biomedical Informatics, Biostatistics, & Computer Science Vanderbilt
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationABC-LogitBoost for Multi-Class Classification
Ping Li, Cornell University ABC-Boost BTRY 6520 Fall 2012 1 ABC-LogitBoost for Multi-Class Classification Ping Li Department of Statistical Science Cornell University 2 4 6 8 10 12 14 16 2 4 6 8 10 12
More informationThe exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.
CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please
More informationMachine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?
Linear Regression Machine Learning CSE546 Sham Kakade University of Washington Oct 4, 2016 1 What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do
More information