Classification objectives COMS 4771

Similar documents
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Decision trees COMS 4771

Logistic Regression. Machine Learning Fall 2018

1 Machine Learning Concepts (16 points)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning for NLP

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Lecture 10 February 23

Warm up: risk prediction with logistic regression

Consistency of Nearest Neighbor Methods

Lecture 3: Multiclass Classification

Fast learning rates for plug-in classifiers under the margin condition

Logistic regression and linear classifiers COMS 4771

Classification and Pattern Recognition

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

Learning from Corrupted Binary Labels via Class-Probability Estimation

Classification Based on Probability

Lecture 2 Machine Learning Review

Learning with Rejection

IFT Lecture 7 Elements of statistical learning theory

Logistic Regression Trained with Different Loss Functions. Discussion

Neural networks COMS 4771

Support vector machines Lecture 4

1 Review of Winnow Algorithm

MLE/MAP + Naïve Bayes

Homework 6. Due: 10am Thursday 11/30/17

Machine Learning for NLP

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Support Vector Machines for Classification: A Statistical Portrait

AdaBoost and other Large Margin Classifiers: Convexity in Classification

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Discriminative Models

Machine Learning and Data Mining. Linear classification. Kalev Kask

Evaluation requires to define performance measures to be optimized

Day 4: Classification, support vector machines

Support Vector Machines and Kernel Methods

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Introduction to Machine Learning Midterm, Tues April 8

Minimax risk bounds for linear threshold functions

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Linear Classifiers. Michael Collins. January 18, 2012

Linear Methods for Classification

Discriminative Models

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Empirical Risk Minimization

Recap from previous lecture

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Machine Learning Practice Page 2 of 2 10/28/13

Surrogate regret bounds for generalized classification performance metrics

Evaluation. Andrea Passerini Machine Learning. Evaluation

Support Vector Machines, Kernel SVM

Midterm: CS 6375 Spring 2015 Solutions

CIS 520: Machine Learning Oct 09, Kernel Methods

Lecture 3: Statistical Decision Theory (Part II)

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Gaussian and Linear Discriminant Analysis; Multiclass Classification

FINAL: CS 6375 (Machine Learning) Fall 2014

Stephen Scott.

Understanding Generalization Error: Bounds and Decompositions

Algorithms for Predicting Structured Data

Pattern Recognition and Machine Learning

Linear regression COMS 4771

Convex optimization COMS 4771

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning: A Statistics and Optimization Perspective

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Natural Language Processing (CSEP 517): Text Classification

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Support Vector Machines

Online Advertising is Big Business

CS60021: Scalable Data Mining. Large Scale Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Announcements. Proposals graded

GRADIENT DESCENT AND LOCAL MINIMA

Overfitting, Bias / Variance Analysis

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Tufts COMP 135: Introduction to Machine Learning

Gradient-Based Learning. Sargur N. Srihari

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Introduction to Machine Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

18.9 SUPPORT VECTOR MACHINES

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Lecture 3 Feedforward Networks and Backpropagation

Linear Models in Machine Learning

Transcription:

Classification objectives COMS 4771

1. Recap: binary classification

Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22

Scoring functions Consider binary classification problems with Y = { 1, +1}. { +1 if h(x) > 0 WLOG, can write a classifier as x 1 if h(x) 0 for some scoring function h: X R. 1 / 22

Scoring functions Consider binary classification problems with Y = { 1, +1}. { +1 if h(x) > 0 WLOG, can write a classifier as x 1 if h(x) 0 for some scoring function h: X R. Some examples: For linear classifiers: h(x) = x T w For Bayes optimal classifier: h(x) = P(Y = 1 X = x) 1/2 1 / 22

Scoring functions Consider binary classification problems with Y = { 1, +1}. { +1 if h(x) > 0 WLOG, can write a classifier as x 1 if h(x) 0 for some scoring function h: X R. Some examples: For linear classifiers: h(x) = x T w For Bayes optimal classifier: h(x) = P(Y = 1 X = x) 1/2 [ ( ) ] Expected zero-one loss is E l 0/1 Y h(x) where l 0/1 (z) = 1{z 0}: l 0/1 (z) 1 1 1 z 1 / 22

Zero-one loss and some surrogate losses 2 1.5 1 0.5 0/1 hinge squared logistic exponential 0-0.5-1 -0.5 0 0.5 1 1.5 z Some surrogate losses that upper-bound l 0/1 : hinge l hinge (z) = [1 z] + squared l sq(z) = (1 z) 2 logistic l logistic (z) = log 2 (1 + e z ) exponential l exp(z) = e z 2 / 22

Zero-one loss and some surrogate losses 2 1.5 1 0.5 0/1 hinge squared logistic exponential 0-0.5-1 -0.5 0 0.5 1 1.5 z Some surrogate losses that upper-bound l 0/1 : hinge l hinge (z) = [1 z] + squared l sq(z) = (1 z) 2 logistic l logistic (z) = log 2 (1 + e z ) exponential l exp(z) = e z Note: when y {±1} and p R, l sq(yp) = (1 yp) 2 = (y p) 2. 2 / 22

Beyond binary classification What if different types of mistakes have different costs? What if we want the (conditional) probability of a positive label? What if there are more than two classes? 3 / 22

2. Cost-sensitive classification

Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 4 / 22

Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) 4 / 22

Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) Cost-sensitive zero-one loss: l (c) 0/1 (y, p) = ( 1{y = 1} (1 c) + 1{y = 1} c ) l 0/1 (yp). 4 / 22

Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) Cost-sensitive l-loss (for loss l: R R): l (c) (y, p) = ( 1{y = 1} (1 c) + 1{y = 1} c ) l(yp). 4 / 22

Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) Cost-sensitive l-loss (for loss l: R R): l (c) (y, p) = ( 1{y = 1} (1 c) + 1{y = 1} c ) l(yp). Fact: if l is convex, then so is l (c) (y, ) for each y {±1}. 4 / 22

Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) Cost-sensitive l-loss (for loss l: R R): l (c) (y, p) = ( 1{y = 1} (1 c) + 1{y = 1} c ) l(yp). Fact: if l is convex, then so is l (c) (y, ) for each y {±1}. Cost-sensitive (empirical) l-risk of scoring function h: X R: [ ] R (c) (h) := E l (c) (Y, h(x)) Our actual objective. R (c) (h) := 1 n n l (c) (Y i, h(x i)) i=1 What we can try to minimize. 4 / 22

Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? 5 / 22

Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. 5 / 22

Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. Conditional on X = x, the minimizer of conditional cost-sensitive risk [ ] ŷ E l (c) (Y, ŷ) X = x = η(x) (1 c) 1{ŷ = 1} + (1 η(x)) c 1{ŷ = +1} is ŷ := { +1 if η(x) (1 c) > (1 η(x)) c, 1 otherwise. 5 / 22

Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. Conditional on X = x, the minimizer of conditional cost-sensitive risk [ ] ŷ E l (c) (Y, ŷ) X = x = η(x) (1 c) 1{ŷ = 1} + (1 η(x)) c 1{ŷ = +1} is ŷ := { +1 if η(x) > c, 1 otherwise. 5 / 22

Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. Conditional on X = x, the minimizer of conditional cost-sensitive risk [ ] ŷ E l (c) (Y, ŷ) X = x = η(x) (1 c) 1{ŷ = 1} + (1 η(x)) c 1{ŷ = +1} is ŷ := { +1 if η(x) > c, 1 otherwise. Therefore, the classifier based on scoring function h(x) := η(x) c, x X has the smallest cost-sensitive risk R (c). 5 / 22

Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. Conditional on X = x, the minimizer of conditional cost-sensitive risk [ ] ŷ E l (c) (Y, ŷ) X = x = η(x) (1 c) 1{ŷ = 1} + (1 η(x)) c 1{ŷ = +1} is ŷ := { +1 if η(x) > c, 1 otherwise. Therefore, the classifier based on scoring function h(x) := η(x) c, x X has the smallest cost-sensitive risk R (c). But where does c comes from? 5 / 22

Common performance criteria 6 / 22

Common performance criteria Precision: P(Y = +1 h(x) > 0). 6 / 22

Common performance criteria Precision: P(Y = +1 h(x) > 0). Recall (a.k.a. True Positive Rate): P(h(X) > 0 Y = +1). Same as 1 False Negative Rate. 6 / 22

Common performance criteria Precision: P(Y = +1 h(x) > 0). Recall (a.k.a. True Positive Rate): P(h(X) > 0 Y = +1). Same as 1 False Negative Rate. False Positive Rate: P(h(X) > 0 Y = 1). Same as 1 True Negative Rate. 6 / 22

Common performance criteria Precision: P(Y = +1 h(x) > 0). Recall (a.k.a. True Positive Rate): P(h(X) > 0 Y = +1). Same as 1 False Negative Rate. False Positive Rate: P(h(X) > 0 Y = 1). Same as 1 True Negative Rate.... 6 / 22

Example: balanced error rate Suppose you are care about Balanced Error Rate (BER): BER := 1 2 False Negative Rate + 1 False Positive Rate. 2 Which cost-sensitive risk should you (try to) minimize? 7 / 22

Example: balanced error rate Suppose you are care about Balanced Error Rate (BER): BER := 1 2 False Negative Rate + 1 False Positive Rate. 2 Which cost-sensitive risk should you (try to) minimize? 2 BER = P ( h(x) 0 Y = +1 ) + P ( h(x) > 0 Y = 1 ) }{{}}{{} FNR FPR 7 / 22

Example: balanced error rate Suppose you are care about Balanced Error Rate (BER): BER := 1 2 False Negative Rate + 1 False Positive Rate. 2 Which cost-sensitive risk should you (try to) minimize? 2 BER = P ( h(x) 0 Y = +1 ) }{{} FNR where π := P(Y = +1). + P ( h(x) > 0 Y = 1 ) }{{} FPR = 1 π P ( h(x) 0 Y = +1 ) + 1 1 π P ( h(x) > 0 Y = 1 ) 7 / 22

Example: balanced error rate Suppose you are care about Balanced Error Rate (BER): BER := 1 2 False Negative Rate + 1 False Positive Rate. 2 Which cost-sensitive risk should you (try to) minimize? 2 BER = P ( h(x) 0 Y = +1 ) }{{} FNR + P ( h(x) > 0 Y = 1 ) }{{} FPR = 1 π P ( h(x) 0 Y = +1 ) + 1 1 π P ( h(x) > 0 Y = 1 ) where π := P(Y = +1). So use R (c) with c := (which you can estimate). 1 1 π 1 1 π + 1 π = π = P(Y = +1) 7 / 22

Even more general setting Distribution over importance-weighted examples: Every example comes with example-specific weight: (X, Y, W ) P where W is the (non-negative) importance weight of the example. 8 / 22

Even more general setting Distribution over importance-weighted examples: Every example comes with example-specific weight: (X, Y, W ) P where W is the (non-negative) importance weight of the example. Importance-weighted l-risk of h is R(h) := E [ W l(y h(x)) ]. 8 / 22

Even more general setting Distribution over importance-weighted examples: Every example comes with example-specific weight: (X, Y, W ) P where W is the (non-negative) importance weight of the example. Importance-weighted l-risk of h is R(h) := E [ W l(y h(x)) ]. (This comes up in Boosting, online decision-making, etc.) 8 / 22

Even more general setting Distribution over importance-weighted examples: Every example comes with example-specific weight: (X, Y, W ) P where W is the (non-negative) importance weight of the example. Importance-weighted l-risk of h is R(h) := E [ W l(y h(x)) ]. (This comes up in Boosting, online decision-making, etc.) Estimate using importance-weighted sample (X 1, Y 1, W 1),..., (X n, Y n, W n): R(h) := 1 n n W i l(y ih(x i)). i=1 8 / 22

Example: importance weights Suppose you have two ways of collecting training data: Type 1: randomly draw (X, Y ) P. (E.g., all e-mails) 9 / 22

Example: importance weights Suppose you have two ways of collecting training data: Type 1: randomly draw (X, Y ) P. (E.g., all e-mails) Type 2 (using filter φ: X {0, 1}): draw (X, Y ) P, return (X, Y ) if φ(x) = 1, repeat otherwise. (E.g., e-mails that made it past the spam filter) Can we estimate E (X,Y ) P l(y h(x)) only using examples of Type 2? 9 / 22

Example: importance weights Suppose you have two ways of collecting training data: Type 1: randomly draw (X, Y ) P. (E.g., all e-mails) Type 2 (using filter φ: X {0, 1}): draw (X, Y ) P, return (X, Y ) if φ(x) = 1, repeat otherwise. (E.g., e-mails that made it past the spam filter) Can we estimate E (X,Y ) P l(y h(x)) only using examples of Type 2? Type 3 (using probabilistic filter q : X (0, 1)): draw (X, Y ) P, toss a coin with heads probability q(x), return (X, Y ) if outcome B is heads, repeat otherwise. (E.g., e-mails that made it past the randomized filter) Can we estimate E (X,Y ) P l(y h(x)) only using examples of Type 3? 9 / 22

3. Conditional probability estimation

Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. 10 / 22

Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. Example: Suppose I have a classifier f : X { 1, +1}. False positives cost c, false negatives cost 1 c. 10 / 22

Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. Example: Suppose I have a classifier f : X { 1, +1}. False positives cost c, false negatives cost 1 c. Given x, how to estimate expected cost of prediction f(x)? 10 / 22

Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. Example: Suppose I have a classifier f : X { 1, +1}. False positives cost c, false negatives cost 1 c. Given x, how to estimate expected cost of prediction f(x)? Expected cost: { (1 c) P(Y = +1 X = x) if f(x) = 1 c P(Y = 1 X = x) if f(x) = +1 10 / 22

Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. Example: Suppose I have a classifier f : X { 1, +1}. False positives cost c, false negatives cost 1 c. Given x, how to estimate expected cost of prediction f(x)? Expected cost: { (1 c) P(Y = +1 X = x) if f(x) = 1 c P(Y = 1 X = x) if f(x) = +1 To estimate expected cost, suffices to estimate η(x) = P(Y = +1 X = x). 10 / 22

Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? 11 / 22

Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? Squared loss: l sq(z) = (1 z) 2 11 / 22

Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? Squared loss: l sq(z) = (1 z) 2 E[l sq(y h(x)) X = x] = η(x) (1 h(x)) 2 + (1 η(x)) (1 + h(x)) 2. 11 / 22

Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? Squared loss: l sq(z) = (1 z) 2 E[l sq(y h(x)) X = x] = η(x) (1 h(x)) 2 + (1 η(x)) (1 + h(x)) 2. Using calculus, can see that this is minimized by h s.t. h(x) = 2η(x) 1, i.e., η(x) = 1+h(x) 2. 11 / 22

Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? Squared loss: l sq(z) = (1 z) 2 E[l sq(y h(x)) X = x] = η(x) (1 h(x)) 2 + (1 η(x)) (1 + h(x)) 2. Using calculus, can see that this is minimized by h s.t. h(x) = 2η(x) 1, i.e., η(x) = 1+h(x) 2. Recipe using plug-in principle: 1. Find scoring function ĥ: X R to (approximately) minimize (empirical) squared loss risk. 2. Construct estimate ˆη via ˆη(x) := 1 + ĥ(x), x X. 2 (May want to clip output to range [0, 1].) 11 / 22

Eliciting conditional probabilities (again) 12 / 22

Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) 12 / 22

Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) E[l logistic (Y h(x)) X = x] = η(x) log 2 (1+e h(x) )+(1 η(x)) log 2 (1+e h(x) ). 12 / 22

Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) E[l logistic (Y h(x)) X = x] = η(x) log 2 (1+e h(x) )+(1 η(x)) log 2 (1+e h(x) ). This is minimized by h s.t. logistic(h(x)) = η(x). 12 / 22

Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) E[l logistic (Y h(x)) X = x] = η(x) log 2 (1+e h(x) )+(1 η(x)) log 2 (1+e h(x) ). This is minimized by h s.t. logistic(h(x)) = η(x). Recipe using plug-in principle: 1. Find scoring function ĥ: X R to (approximately) minimize (empirical) logistic loss risk. 2. Construct estimate ˆη via ˆη(x) := logistic(h(x)), x X. 12 / 22

Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) E[l logistic (Y h(x)) X = x] = η(x) log 2 (1+e h(x) )+(1 η(x)) log 2 (1+e h(x) ). This is minimized by h s.t. logistic(h(x)) = η(x). Recipe using plug-in principle: 1. Find scoring function ĥ: X R to (approximately) minimize (empirical) logistic loss risk. 2. Construct estimate ˆη via ˆη(x) := logistic(h(x)), x X. This works for many other losses... 12 / 22

Deficiency of hinge loss But not hinge loss: l hinge (z) = max{0, 1 z}. 13 / 22

Deficiency of hinge loss But not hinge loss: l hinge (z) = max{0, 1 z}. E[l hinge (Y h(x)) X = x] is minimized by h s.t. h(x) = sign(2η(x) 1). conditional hinge loss risk 4 3 2 1 (x)=1/4 (x)=1/3 (x)=2/3 (x)=3/4 0-5 -4-3 -2-1 0 1 2 3 4 5 h(x) 13 / 22

Deficiency of hinge loss But not hinge loss: l hinge (z) = max{0, 1 z}. E[l hinge (Y h(x)) X = x] is minimized by h s.t. h(x) = sign(2η(x) 1). conditional hinge loss risk 4 3 2 1 (x)=1/4 (x)=1/3 (x)=2/3 (x)=3/4 0-5 -4-3 -2-1 0 1 2 3 4 5 h(x) Cannot recover η(x) from the minimizing h(x). 13 / 22

Difficulty of estimating conditional probabilities Accurately estimating conditional probabilities is harder than learning a good classifier. Example: In general, min E[lsq(Y X T w)] min E[l sq(y h(x))]. w R d h : R d R Optimal scoring function x 2η(x) 1 is not necessarily well-approximated by a linear function. 14 / 22

Difficulty of estimating conditional probabilities Accurately estimating conditional probabilities is harder than learning a good classifier. Example: In general, min E[lsq(Y X T w)] min E[l sq(y h(x))]. w R d h : R d R Optimal scoring function x 2η(x) 1 is not necessarily well-approximated by a linear function. (We ll see an example of this later.) 14 / 22

Difficulty of estimating conditional probabilities Accurately estimating conditional probabilities is harder than learning a good classifier. Example: In general, min E[lsq(Y X T w)] min E[l sq(y h(x))]. w R d h : R d R Optimal scoring function x 2η(x) 1 is not necessarily well-approximated by a linear function. (We ll see an example of this later.) Common remedies: use more flexible models (e.g., polynomial functions). 14 / 22

Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. 15 / 22

Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. E.g., It rains on half of the days on which you say 50% chance of rain. 15 / 22

Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. E.g., It rains on half of the days on which you say 50% chance of rain. Note: This is achieved by a constant function: ˆη(x) = P(Y = 1), x X. So calibration is weaker than approximating the conditional probability function. 15 / 22

Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. E.g., It rains on half of the days on which you say 50% chance of rain. Note: This is achieved by a constant function: ˆη(x) = P(Y = 1), x X. So calibration is weaker than approximating the conditional probability function. Useful for interpretability. 15 / 22

Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. E.g., It rains on half of the days on which you say 50% chance of rain. Note: This is achieved by a constant function: ˆη(x) = P(Y = 1), x X. So calibration is weaker than approximating the conditional probability function. Useful for interpretability. To (approximately) achieve calibration, can use post-processing on output of scoring function h. 15 / 22

Achieving calibration via post-processing Use (validation) data ((x i, y i)) n i=1 from X {±1} to learn g : R [0, 1] so that ˆη := g h is (approximately) calibrated. 16 / 22

Achieving calibration via post-processing Use (validation) data ((x i, y i)) n i=1 from X {±1} to learn g : R [0, 1] so that ˆη := g h is (approximately) calibrated. Option #1: Use flexible model like Nadaraya-Watson regression to minimize suitable loss (e.g., squared loss). E.g., n i=1 1{yi = +1} K((z h(xi))/σ) g(z) := n i=1 K((z h(xi)/σ). K is a smoothing kernel, like K(δ) = exp( δ 2 /2); σ > 0 is the bandwidth. 16 / 22

Achieving calibration via post-processing Use (validation) data ((x i, y i)) n i=1 from X {±1} to learn g : R [0, 1] so that ˆη := g h is (approximately) calibrated. Option #1: Use flexible model like Nadaraya-Watson regression to minimize suitable loss (e.g., squared loss). E.g., n i=1 1{yi = +1} K((z h(xi))/σ) g(z) := n i=1 K((z h(xi)/σ). K is a smoothing kernel, like K(δ) = exp( δ 2 /2); σ > 0 is the bandwidth. Option #2: isotonic regression, which finds a monotone increasing g. 1 0.8 1{y i =1} g(h(x i )) 0.6 0.4 0.2 0-6 -4-2 0 2 4 6 h(x) 16 / 22

Achieving calibration via post-processing Use (validation) data ((x i, y i)) n i=1 from X {±1} to learn g : R [0, 1] so that ˆη := g h is (approximately) calibrated. Option #1: Use flexible model like Nadaraya-Watson regression to minimize suitable loss (e.g., squared loss). E.g., n i=1 1{yi = +1} K((z h(xi))/σ) g(z) := n i=1 K((z h(xi)/σ). K is a smoothing kernel, like K(δ) = exp( δ 2 /2); σ > 0 is the bandwidth. Option #2: isotonic regression, which finds a monotone increasing g. 1 0.8 1{y i =1} g(h(x i )) 0.6 0.4 0.2 0-6 -4-2 0 2 4 6 h(x) And many others... 16 / 22

4. Multiclass classification

Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). 17 / 22

Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). Logistic regression: generalizes to multinomial logistic regression, where P(Y = k X = x) = MLE here is similar to MLE in binary case. e xt β k K, x R d. k =1 ext β k 17 / 22

Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). Logistic regression: generalizes to multinomial logistic regression, where P(Y = k X = x) = e xt β k K, x R d. k =1 ext β k MLE here is similar to MLE in binary case. Perceptron, SVM: many extensions for multiclass. 17 / 22

Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). Logistic regression: generalizes to multinomial logistic regression, where P(Y = k X = x) = e xt β k K, x R d. k =1 ext β k MLE here is similar to MLE in binary case. Perceptron, SVM: many extensions for multiclass. Neural networks: often formulated like multinomial logistic regression. 17 / 22

Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). Logistic regression: generalizes to multinomial logistic regression, where P(Y = k X = x) = e xt β k K, x R d. k =1 ext β k MLE here is similar to MLE in binary case. Perceptron, SVM: many extensions for multiclass. Neural networks: often formulated like multinomial logistic regression. Can we generically reduce multiclass classification to binary classification? 17 / 22

One-against-all (of the rest) reduction One-against-all training input Labeled examples ((x i, y i )) n i=1 from X {1,..., K}; learning algorithm A that takes data from X {±1} and returns a scoring function h: X R. output Scoring functions ĥ1,..., ĥk : X R. 1: for k = 1,..., K do { 2: Create data set S k = ((x i, y (k) i )) n +1 if y i = k, i=1 where y(k) i = 1 otherwise. 3: Run A on S k to obtain scoring function h k : X R. 4: end for 5: return Scoring functions ĥ1,..., ĥk. 18 / 22

One-against-all (of the rest) reduction One-against-all training input Labeled examples ((x i, y i )) n i=1 from X {1,..., K}; learning algorithm A that takes data from X {±1} and returns a scoring function h: X R. output Scoring functions ĥ1,..., ĥk : X R. 1: for k = 1,..., K do { 2: Create data set S k = ((x i, y (k) i )) n +1 if y i = k, i=1 where y(k) i = 1 otherwise. 3: Run A on S k to obtain scoring function h k : X R. 4: end for 5: return Scoring functions ĥ1,..., ĥk. One-against-all prediction input Scoring functions ĥ1,..., ĥk : X R; new point x X. output Predicted label ŷ {1,..., K}. 1: return ŷ arg max k {1,...,K} ĥ k (x). 18 / 22

When does one-against-all work well? Main idea: Suppose every ĥk corresponds to good estimate ˆη k of conditional probability function ˆη k (x) P(Y = k X = x) on average. (E.g., ˆη k (x) = 1+ĥk(x) 2 for all k = 1,..., K.) 19 / 22

When does one-against-all work well? Main idea: Suppose every ĥk corresponds to good estimate ˆη k of conditional probability function ˆη k (x) P(Y = k X = x) on average. (E.g., ˆη k (x) = 1+ĥk(x) 2 for all k = 1,..., K.) Behavior of one-against-all classifier ˆf similar to Bayes optimal classifier! 19 / 22

Masking Suppose X = R, Y = {1, 2, 3}, and P Y X=x is piece-wise constant: P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = 1 E.g., P(Y = 1 X = x) = 1{x [0.1, 0.9]}, P(Y = 2 X = x) = 1{x [1.1, 1.9]}, P(Y = 3 X = x) = 1{x [2.1, 2.9]}. 20 / 22

Masking Suppose X = R, Y = {1, 2, 3}, and P Y X=x is piece-wise constant: P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = 1 E.g., P(Y = 1 X = x) = 1{x [0.1, 0.9]}, P(Y = 2 X = x) = 1{x [1.1, 1.9]}, P(Y = 3 X = x) = 1{x [2.1, 2.9]}. Try to fit P(Y = k X = x) using affine functions of x, e.g., ˆη k (x) = m k x + b k for some m k, b k R. 20 / 22

Masking (continued) P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = 1 1.0 Best affine approximations to P(Y = k X = x) for each k: 0.5 0.0 21 / 22

Masking (continued) P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = 1 1.0 Best affine approximations to P(Y = k X = x) for each k: 0.5 0.0 The arg max k ˆη k (x) is never 2, so never predicts label 2. 21 / 22

Masking (continued) P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = 1 1.0 Best affine approximations to P(Y = k X = x) for each k: 0.5 0.0 The arg max k ˆη k (x) is never 2, so never predicts label 2. How to fix this? 21 / 22

Key takeaways 1. Scoring functions and thresholds; alternative performance criteria. 2. Eliciting conditional probabilities with loss functions. 3. Calibration property. 4. One-against-all approach to multiclass classification; masking. 22 / 22