Classification objectives COMS PDF Free Download

Classification objectives COMS 4771

1. Recap: binary classification

Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22

Scoring functions Consider binary classification problems with Y = { 1, +1}. { +1 if h(x) > 0 WLOG, can write a classifier as x 1 if h(x) 0 for some scoring function h: X R. 1 / 22

Scoring functions Consider binary classification problems with Y = { 1, +1}. { +1 if h(x) > 0 WLOG, can write a classifier as x 1 if h(x) 0 for some scoring function h: X R. Some examples: For linear classifiers: h(x) = x T w For Bayes optimal classifier: h(x) = P(Y = 1 X = x) 1/2 1 / 22

Zero-one loss and some surrogate losses 2 1.5 1 0.5 0/1 hinge squared logistic exponential 0-0.5-1 -0.5 0 0.5 1 1.5 z Some surrogate losses that upper-bound l 0/1 : hinge l hinge (z) = [1 z] + squared l sq(z) = (1 z) 2 logistic l logistic (z) = log 2 (1 + e z ) exponential l exp(z) = e z 2 / 22

Beyond binary classification What if different types of mistakes have different costs? What if we want the (conditional) probability of a positive label? What if there are more than two classes? 3 / 22

2. Cost-sensitive classification

Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 4 / 22

Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) 4 / 22

Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) Cost-sensitive l-loss (for loss l: R R): l (c) (y, p) = ( 1{y = 1} (1 c) + 1{y = 1} c ) l(yp). Fact: if l is convex, then so is l (c) (y, ) for each y {±1}. Cost-sensitive (empirical) l-risk of scoring function h: X R: [ ] R (c) (h) := E l (c) (Y, h(x)) Our actual objective. R (c) (h) := 1 n n l (c) (Y i, h(x i)) i=1 What we can try to minimize. 4 / 22

Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? 5 / 22

Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. 5 / 22

Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. Conditional on X = x, the minimizer of conditional cost-sensitive risk [ ] ŷ E l (c) (Y, ŷ) X = x = η(x) (1 c) 1{ŷ = 1} + (1 η(x)) c 1{ŷ = +1} is ŷ := { +1 if η(x) (1 c) > (1 η(x)) c, 1 otherwise. 5 / 22

Common performance criteria 6 / 22

Common performance criteria Precision: P(Y = +1 h(x) > 0). 6 / 22

Common performance criteria Precision: P(Y = +1 h(x) > 0). Recall (a.k.a. True Positive Rate): P(h(X) > 0 Y = +1). Same as 1 False Negative Rate. 6 / 22

Common performance criteria Precision: P(Y = +1 h(x) > 0). Recall (a.k.a. True Positive Rate): P(h(X) > 0 Y = +1). Same as 1 False Negative Rate. False Positive Rate: P(h(X) > 0 Y = 1). Same as 1 True Negative Rate. 6 / 22

Example: balanced error rate Suppose you are care about Balanced Error Rate (BER): BER := 1 2 False Negative Rate + 1 False Positive Rate. 2 Which cost-sensitive risk should you (try to) minimize? 2 BER = P ( h(x) 0 Y = +1 ) }{{} FNR where π := P(Y = +1). + P ( h(x) > 0 Y = 1 ) }{{} FPR = 1 π P ( h(x) 0 Y = +1 ) + 1 1 π P ( h(x) > 0 Y = 1 ) 7 / 22

Example: balanced error rate Suppose you are care about Balanced Error Rate (BER): BER := 1 2 False Negative Rate + 1 False Positive Rate. 2 Which cost-sensitive risk should you (try to) minimize? 2 BER = P ( h(x) 0 Y = +1 ) }{{} FNR + P ( h(x) > 0 Y = 1 ) }{{} FPR = 1 π P ( h(x) 0 Y = +1 ) + 1 1 π P ( h(x) > 0 Y = 1 ) where π := P(Y = +1). So use R (c) with c := (which you can estimate). 1 1 π 1 1 π + 1 π = π = P(Y = +1) 7 / 22

Even more general setting Distribution over importance-weighted examples: Every example comes with example-specific weight: (X, Y, W ) P where W is the (non-negative) importance weight of the example. Importance-weighted l-risk of h is R(h) := E [ W l(y h(x)) ]. (This comes up in Boosting, online decision-making, etc.) 8 / 22

Example: importance weights Suppose you have two ways of collecting training data: Type 1: randomly draw (X, Y ) P. (E.g., all e-mails) 9 / 22

Example: importance weights Suppose you have two ways of collecting training data: Type 1: randomly draw (X, Y ) P. (E.g., all e-mails) Type 2 (using filter φ: X {0, 1}): draw (X, Y ) P, return (X, Y ) if φ(x) = 1, repeat otherwise. (E.g., e-mails that made it past the spam filter) Can we estimate E (X,Y ) P l(y h(x)) only using examples of Type 2? Type 3 (using probabilistic filter q : X (0, 1)): draw (X, Y ) P, toss a coin with heads probability q(x), return (X, Y ) if outcome B is heads, repeat otherwise. (E.g., e-mails that made it past the randomized filter) Can we estimate E (X,Y ) P l(y h(x)) only using examples of Type 3? 9 / 22

3. Conditional probability estimation

Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. 10 / 22

Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. Example: Suppose I have a classifier f : X { 1, +1}. False positives cost c, false negatives cost 1 c. Given x, how to estimate expected cost of prediction f(x)? Expected cost: { (1 c) P(Y = +1 X = x) if f(x) = 1 c P(Y = 1 X = x) if f(x) = +1 10 / 22

Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? 11 / 22

Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? Squared loss: l sq(z) = (1 z) 2 11 / 22

Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? Squared loss: l sq(z) = (1 z) 2 E[l sq(y h(x)) X = x] = η(x) (1 h(x)) 2 + (1 η(x)) (1 + h(x)) 2. Using calculus, can see that this is minimized by h s.t. h(x) = 2η(x) 1, i.e., η(x) = 1+h(x) 2. Recipe using plug-in principle: 1. Find scoring function ĥ: X R to (approximately) minimize (empirical) squared loss risk. 2. Construct estimate ˆη via ˆη(x) := 1 + ĥ(x), x X. 2 (May want to clip output to range [0, 1].) 11 / 22

Eliciting conditional probabilities (again) 12 / 22

Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) 12 / 22

Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) E[l logistic (Y h(x)) X = x] = η(x) log 2 (1+e h(x) )+(1 η(x)) log 2 (1+e h(x) ). 12 / 22

Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) E[l logistic (Y h(x)) X = x] = η(x) log 2 (1+e h(x) )+(1 η(x)) log 2 (1+e h(x) ). This is minimized by h s.t. logistic(h(x)) = η(x). Recipe using plug-in principle: 1. Find scoring function ĥ: X R to (approximately) minimize (empirical) logistic loss risk. 2. Construct estimate ˆη via ˆη(x) := logistic(h(x)), x X. 12 / 22

Deficiency of hinge loss But not hinge loss: l hinge (z) = max{0, 1 z}. 13 / 22

Deficiency of hinge loss But not hinge loss: l hinge (z) = max{0, 1 z}. E[l hinge (Y h(x)) X = x] is minimized by h s.t. h(x) = sign(2η(x) 1). conditional hinge loss risk 4 3 2 1 (x)=1/4 (x)=1/3 (x)=2/3 (x)=3/4 0-5 -4-3 -2-1 0 1 2 3 4 5 h(x) 13 / 22

Difficulty of estimating conditional probabilities Accurately estimating conditional probabilities is harder than learning a good classifier. Example: In general, min E[lsq(Y X T w)] min E[l sq(y h(x))]. w R d h : R d R Optimal scoring function x 2η(x) 1 is not necessarily well-approximated by a linear function. 14 / 22

Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. 15 / 22

Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. E.g., It rains on half of the days on which you say 50% chance of rain. 15 / 22

Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. E.g., It rains on half of the days on which you say 50% chance of rain. Note: This is achieved by a constant function: ˆη(x) = P(Y = 1), x X. So calibration is weaker than approximating the conditional probability function. 15 / 22

Achieving calibration via post-processing Use (validation) data ((x i, y i)) n i=1 from X {±1} to learn g : R [0, 1] so that ˆη := g h is (approximately) calibrated. 16 / 22

Achieving calibration via post-processing Use (validation) data ((x i, y i)) n i=1 from X {±1} to learn g : R [0, 1] so that ˆη := g h is (approximately) calibrated. Option #1: Use flexible model like Nadaraya-Watson regression to minimize suitable loss (e.g., squared loss). E.g., n i=1 1{yi = +1} K((z h(xi))/σ) g(z) := n i=1 K((z h(xi)/σ). K is a smoothing kernel, like K(δ) = exp( δ 2 /2); σ > 0 is the bandwidth. 16 / 22

4. Multiclass classification

Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). 17 / 22

Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). Logistic regression: generalizes to multinomial logistic regression, where P(Y = k X = x) = e xt β k K, x R d. k =1 ext β k MLE here is similar to MLE in binary case. Perceptron, SVM: many extensions for multiclass. Neural networks: often formulated like multinomial logistic regression. 17 / 22

One-against-all (of the rest) reduction One-against-all training input Labeled examples ((x i, y i )) n i=1 from X {1,..., K}; learning algorithm A that takes data from X {±1} and returns a scoring function h: X R. output Scoring functions ĥ1,..., ĥk : X R. 1: for k = 1,..., K do { 2: Create data set S k = ((x i, y (k) i )) n +1 if y i = k, i=1 where y(k) i = 1 otherwise. 3: Run A on S k to obtain scoring function h k : X R. 4: end for 5: return Scoring functions ĥ1,..., ĥk. 18 / 22

When does one-against-all work well? Main idea: Suppose every ĥk corresponds to good estimate ˆη k of conditional probability function ˆη k (x) P(Y = k X = x) on average. (E.g., ˆη k (x) = 1+ĥk(x) 2 for all k = 1,..., K.) 19 / 22

Masking Suppose X = R, Y = {1, 2, 3}, and P Y X=x is piece-wise constant: P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = 1 E.g., P(Y = 1 X = x) = 1{x [0.1, 0.9]}, P(Y = 2 X = x) = 1{x [1.1, 1.9]}, P(Y = 3 X = x) = 1{x [2.1, 2.9]}. Try to fit P(Y = k X = x) using affine functions of x, e.g., ˆη k (x) = m k x + b k for some m k, b k R. 20 / 22

Masking (continued) P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = 1 1.0 Best affine approximations to P(Y = k X = x) for each k: 0.5 0.0 21 / 22

Masking (continued) P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = 1 1.0 Best affine approximations to P(Y = k X = x) for each k: 0.5 0.0 The arg max k ˆη k (x) is never 2, so never predicts label 2. 21 / 22

Key takeaways 1. Scoring functions and thresholds; alternative performance criteria. 2. Eliciting conditional probabilities with loss functions. 3. Calibration property. 4. One-against-all approach to multiclass classification; masking. 22 / 22