Classification objectives COMS 4771

Size: px

Start display at page:

Download "Classification objectives COMS 4771"

Irma Morton
5 years ago
Views:

1 Classification objectives COMS 4771

2 1. Recap: binary classification

3 Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22

4 Scoring functions Consider binary classification problems with Y = { 1, +1}. { +1 if h(x) > 0 WLOG, can write a classifier as x 1 if h(x) 0 for some scoring function h: X R. 1 / 22

5 Scoring functions Consider binary classification problems with Y = { 1, +1}. { +1 if h(x) > 0 WLOG, can write a classifier as x 1 if h(x) 0 for some scoring function h: X R. Some examples: For linear classifiers: h(x) = x T w For Bayes optimal classifier: h(x) = P(Y = 1 X = x) 1/2 1 / 22

6 Scoring functions Consider binary classification problems with Y = { 1, +1}. { +1 if h(x) > 0 WLOG, can write a classifier as x 1 if h(x) 0 for some scoring function h: X R. Some examples: For linear classifiers: h(x) = x T w For Bayes optimal classifier: h(x) = P(Y = 1 X = x) 1/2 [ ( ) ] Expected zero-one loss is E l 0/1 Y h(x) where l 0/1 (z) = 1{z 0}: l 0/1 (z) z 1 / 22

7 Zero-one loss and some surrogate losses /1 hinge squared logistic exponential z Some surrogate losses that upper-bound l 0/1 : hinge l hinge (z) = [1 z] + squared l sq(z) = (1 z) 2 logistic l logistic (z) = log 2 (1 + e z ) exponential l exp(z) = e z 2 / 22

8 Zero-one loss and some surrogate losses /1 hinge squared logistic exponential z Some surrogate losses that upper-bound l 0/1 : hinge l hinge (z) = [1 z] + squared l sq(z) = (1 z) 2 logistic l logistic (z) = log 2 (1 + e z ) exponential l exp(z) = e z Note: when y {±1} and p R, l sq(yp) = (1 yp) 2 = (y p) 2. 2 / 22

9 Beyond binary classification What if different types of mistakes have different costs? What if we want the (conditional) probability of a positive label? What if there are more than two classes? 3 / 22

10 2. Cost-sensitive classification

11 Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 4 / 22

12 Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) 4 / 22

13 Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) Cost-sensitive zero-one loss: l (c) 0/1 (y, p) = ( 1{y = 1} (1 c) + 1{y = 1} c ) l 0/1 (yp). 4 / 22

14 Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) Cost-sensitive l-loss (for loss l: R R): l (c) (y, p) = ( 1{y = 1} (1 c) + 1{y = 1} c ) l(yp). 4 / 22

15 Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) Cost-sensitive l-loss (for loss l: R R): l (c) (y, p) = ( 1{y = 1} (1 c) + 1{y = 1} c ) l(yp). Fact: if l is convex, then so is l (c) (y, ) for each y {±1}. 4 / 22

16 Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) Cost-sensitive l-loss (for loss l: R R): l (c) (y, p) = ( 1{y = 1} (1 c) + 1{y = 1} c ) l(yp). Fact: if l is convex, then so is l (c) (y, ) for each y {±1}. Cost-sensitive (empirical) l-risk of scoring function h: X R: [ ] R (c) (h) := E l (c) (Y, h(x)) Our actual objective. R (c) (h) := 1 n n l (c) (Y i, h(x i)) i=1 What we can try to minimize. 4 / 22

17 Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? 5 / 22

18 Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. 5 / 22

19 Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. Conditional on X = x, the minimizer of conditional cost-sensitive risk [ ] ŷ E l (c) (Y, ŷ) X = x = η(x) (1 c) 1{ŷ = 1} + (1 η(x)) c 1{ŷ = +1} is ŷ := { +1 if η(x) (1 c) > (1 η(x)) c, 1 otherwise. 5 / 22

20 Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. Conditional on X = x, the minimizer of conditional cost-sensitive risk [ ] ŷ E l (c) (Y, ŷ) X = x = η(x) (1 c) 1{ŷ = 1} + (1 η(x)) c 1{ŷ = +1} is ŷ := { +1 if η(x) > c, 1 otherwise. 5 / 22

21 Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. Conditional on X = x, the minimizer of conditional cost-sensitive risk [ ] ŷ E l (c) (Y, ŷ) X = x = η(x) (1 c) 1{ŷ = 1} + (1 η(x)) c 1{ŷ = +1} is ŷ := { +1 if η(x) > c, 1 otherwise. Therefore, the classifier based on scoring function h(x) := η(x) c, x X has the smallest cost-sensitive risk R (c). 5 / 22

22 Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. Conditional on X = x, the minimizer of conditional cost-sensitive risk [ ] ŷ E l (c) (Y, ŷ) X = x = η(x) (1 c) 1{ŷ = 1} + (1 η(x)) c 1{ŷ = +1} is ŷ := { +1 if η(x) > c, 1 otherwise. Therefore, the classifier based on scoring function h(x) := η(x) c, x X has the smallest cost-sensitive risk R (c). But where does c comes from? 5 / 22

23 Common performance criteria 6 / 22

24 Common performance criteria Precision: P(Y = +1 h(x) > 0). 6 / 22

25 Common performance criteria Precision: P(Y = +1 h(x) > 0). Recall (a.k.a. True Positive Rate): P(h(X) > 0 Y = +1). Same as 1 False Negative Rate. 6 / 22

26 Common performance criteria Precision: P(Y = +1 h(x) > 0). Recall (a.k.a. True Positive Rate): P(h(X) > 0 Y = +1). Same as 1 False Negative Rate. False Positive Rate: P(h(X) > 0 Y = 1). Same as 1 True Negative Rate. 6 / 22

27 Common performance criteria Precision: P(Y = +1 h(x) > 0). Recall (a.k.a. True Positive Rate): P(h(X) > 0 Y = +1). Same as 1 False Negative Rate. False Positive Rate: P(h(X) > 0 Y = 1). Same as 1 True Negative Rate / 22

28 Example: balanced error rate Suppose you are care about Balanced Error Rate (BER): BER := 1 2 False Negative Rate + 1 False Positive Rate. 2 Which cost-sensitive risk should you (try to) minimize? 7 / 22

29 Example: balanced error rate Suppose you are care about Balanced Error Rate (BER): BER := 1 2 False Negative Rate + 1 False Positive Rate. 2 Which cost-sensitive risk should you (try to) minimize? 2 BER = P ( h(x) 0 Y = +1 ) + P ( h(x) > 0 Y = 1 ) }{{}}{{} FNR FPR 7 / 22

30 Example: balanced error rate Suppose you are care about Balanced Error Rate (BER): BER := 1 2 False Negative Rate + 1 False Positive Rate. 2 Which cost-sensitive risk should you (try to) minimize? 2 BER = P ( h(x) 0 Y = +1 ) }{{} FNR where π := P(Y = +1). + P ( h(x) > 0 Y = 1 ) }{{} FPR = 1 π P ( h(x) 0 Y = +1 ) π P ( h(x) > 0 Y = 1 ) 7 / 22

31 Example: balanced error rate Suppose you are care about Balanced Error Rate (BER): BER := 1 2 False Negative Rate + 1 False Positive Rate. 2 Which cost-sensitive risk should you (try to) minimize? 2 BER = P ( h(x) 0 Y = +1 ) }{{} FNR + P ( h(x) > 0 Y = 1 ) }{{} FPR = 1 π P ( h(x) 0 Y = +1 ) π P ( h(x) > 0 Y = 1 ) where π := P(Y = +1). So use R (c) with c := (which you can estimate). 1 1 π 1 1 π + 1 π = π = P(Y = +1) 7 / 22

32 Even more general setting Distribution over importance-weighted examples: Every example comes with example-specific weight: (X, Y, W ) P where W is the (non-negative) importance weight of the example. 8 / 22

33 Even more general setting Distribution over importance-weighted examples: Every example comes with example-specific weight: (X, Y, W ) P where W is the (non-negative) importance weight of the example. Importance-weighted l-risk of h is R(h) := E [ W l(y h(x)) ]. 8 / 22

34 Even more general setting Distribution over importance-weighted examples: Every example comes with example-specific weight: (X, Y, W ) P where W is the (non-negative) importance weight of the example. Importance-weighted l-risk of h is R(h) := E [ W l(y h(x)) ]. (This comes up in Boosting, online decision-making, etc.) 8 / 22

35 Even more general setting Distribution over importance-weighted examples: Every example comes with example-specific weight: (X, Y, W ) P where W is the (non-negative) importance weight of the example. Importance-weighted l-risk of h is R(h) := E [ W l(y h(x)) ]. (This comes up in Boosting, online decision-making, etc.) Estimate using importance-weighted sample (X 1, Y 1, W 1),..., (X n, Y n, W n): R(h) := 1 n n W i l(y ih(x i)). i=1 8 / 22

36 Example: importance weights Suppose you have two ways of collecting training data: Type 1: randomly draw (X, Y ) P. (E.g., all s) 9 / 22

37 Example: importance weights Suppose you have two ways of collecting training data: Type 1: randomly draw (X, Y ) P. (E.g., all s) Type 2 (using filter φ: X {0, 1}): draw (X, Y ) P, return (X, Y ) if φ(x) = 1, repeat otherwise. (E.g., s that made it past the spam filter) Can we estimate E (X,Y ) P l(y h(x)) only using examples of Type 2? 9 / 22

38 Example: importance weights Suppose you have two ways of collecting training data: Type 1: randomly draw (X, Y ) P. (E.g., all s) Type 2 (using filter φ: X {0, 1}): draw (X, Y ) P, return (X, Y ) if φ(x) = 1, repeat otherwise. (E.g., s that made it past the spam filter) Can we estimate E (X,Y ) P l(y h(x)) only using examples of Type 2? Type 3 (using probabilistic filter q : X (0, 1)): draw (X, Y ) P, toss a coin with heads probability q(x), return (X, Y ) if outcome B is heads, repeat otherwise. (E.g., s that made it past the randomized filter) Can we estimate E (X,Y ) P l(y h(x)) only using examples of Type 3? 9 / 22

39 3. Conditional probability estimation

40 Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. 10 / 22

41 Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. Example: Suppose I have a classifier f : X { 1, +1}. False positives cost c, false negatives cost 1 c. 10 / 22

42 Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. Example: Suppose I have a classifier f : X { 1, +1}. False positives cost c, false negatives cost 1 c. Given x, how to estimate expected cost of prediction f(x)? 10 / 22

43 Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. Example: Suppose I have a classifier f : X { 1, +1}. False positives cost c, false negatives cost 1 c. Given x, how to estimate expected cost of prediction f(x)? Expected cost: { (1 c) P(Y = +1 X = x) if f(x) = 1 c P(Y = 1 X = x) if f(x) = / 22

44 Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. Example: Suppose I have a classifier f : X { 1, +1}. False positives cost c, false negatives cost 1 c. Given x, how to estimate expected cost of prediction f(x)? Expected cost: { (1 c) P(Y = +1 X = x) if f(x) = 1 c P(Y = 1 X = x) if f(x) = +1 To estimate expected cost, suffices to estimate η(x) = P(Y = +1 X = x). 10 / 22

45 Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? 11 / 22

46 Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? Squared loss: l sq(z) = (1 z) 2 11 / 22

47 Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? Squared loss: l sq(z) = (1 z) 2 E[l sq(y h(x)) X = x] = η(x) (1 h(x)) 2 + (1 η(x)) (1 + h(x)) / 22

48 Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? Squared loss: l sq(z) = (1 z) 2 E[l sq(y h(x)) X = x] = η(x) (1 h(x)) 2 + (1 η(x)) (1 + h(x)) 2. Using calculus, can see that this is minimized by h s.t. h(x) = 2η(x) 1, i.e., η(x) = 1+h(x) / 22

49 Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? Squared loss: l sq(z) = (1 z) 2 E[l sq(y h(x)) X = x] = η(x) (1 h(x)) 2 + (1 η(x)) (1 + h(x)) 2. Using calculus, can see that this is minimized by h s.t. h(x) = 2η(x) 1, i.e., η(x) = 1+h(x) 2. Recipe using plug-in principle: 1. Find scoring function ĥ: X R to (approximately) minimize (empirical) squared loss risk. 2. Construct estimate ˆη via ˆη(x) := 1 + ĥ(x), x X. 2 (May want to clip output to range [0, 1].) 11 / 22

50 Eliciting conditional probabilities (again) 12 / 22

51 Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) 12 / 22

52 Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) E[l logistic (Y h(x)) X = x] = η(x) log 2 (1+e h(x) )+(1 η(x)) log 2 (1+e h(x) ). 12 / 22

53 Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) E[l logistic (Y h(x)) X = x] = η(x) log 2 (1+e h(x) )+(1 η(x)) log 2 (1+e h(x) ). This is minimized by h s.t. logistic(h(x)) = η(x). 12 / 22

54 Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) E[l logistic (Y h(x)) X = x] = η(x) log 2 (1+e h(x) )+(1 η(x)) log 2 (1+e h(x) ). This is minimized by h s.t. logistic(h(x)) = η(x). Recipe using plug-in principle: 1. Find scoring function ĥ: X R to (approximately) minimize (empirical) logistic loss risk. 2. Construct estimate ˆη via ˆη(x) := logistic(h(x)), x X. 12 / 22

55 Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) E[l logistic (Y h(x)) X = x] = η(x) log 2 (1+e h(x) )+(1 η(x)) log 2 (1+e h(x) ). This is minimized by h s.t. logistic(h(x)) = η(x). Recipe using plug-in principle: 1. Find scoring function ĥ: X R to (approximately) minimize (empirical) logistic loss risk. 2. Construct estimate ˆη via ˆη(x) := logistic(h(x)), x X. This works for many other losses / 22

56 Deficiency of hinge loss But not hinge loss: l hinge (z) = max{0, 1 z}. 13 / 22

57 Deficiency of hinge loss But not hinge loss: l hinge (z) = max{0, 1 z}. E[l hinge (Y h(x)) X = x] is minimized by h s.t. h(x) = sign(2η(x) 1). conditional hinge loss risk (x)=1/4 (x)=1/3 (x)=2/3 (x)=3/ h(x) 13 / 22

58 Deficiency of hinge loss But not hinge loss: l hinge (z) = max{0, 1 z}. E[l hinge (Y h(x)) X = x] is minimized by h s.t. h(x) = sign(2η(x) 1). conditional hinge loss risk (x)=1/4 (x)=1/3 (x)=2/3 (x)=3/ h(x) Cannot recover η(x) from the minimizing h(x). 13 / 22

59 Difficulty of estimating conditional probabilities Accurately estimating conditional probabilities is harder than learning a good classifier. Example: In general, min E[lsq(Y X T w)] min E[l sq(y h(x))]. w R d h : R d R Optimal scoring function x 2η(x) 1 is not necessarily well-approximated by a linear function. 14 / 22

60 Difficulty of estimating conditional probabilities Accurately estimating conditional probabilities is harder than learning a good classifier. Example: In general, min E[lsq(Y X T w)] min E[l sq(y h(x))]. w R d h : R d R Optimal scoring function x 2η(x) 1 is not necessarily well-approximated by a linear function. (We ll see an example of this later.) 14 / 22

61 Difficulty of estimating conditional probabilities Accurately estimating conditional probabilities is harder than learning a good classifier. Example: In general, min E[lsq(Y X T w)] min E[l sq(y h(x))]. w R d h : R d R Optimal scoring function x 2η(x) 1 is not necessarily well-approximated by a linear function. (We ll see an example of this later.) Common remedies: use more flexible models (e.g., polynomial functions). 14 / 22

62 Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. 15 / 22

63 Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. E.g., It rains on half of the days on which you say 50% chance of rain. 15 / 22

64 Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. E.g., It rains on half of the days on which you say 50% chance of rain. Note: This is achieved by a constant function: ˆη(x) = P(Y = 1), x X. So calibration is weaker than approximating the conditional probability function. 15 / 22

65 Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. E.g., It rains on half of the days on which you say 50% chance of rain. Note: This is achieved by a constant function: ˆη(x) = P(Y = 1), x X. So calibration is weaker than approximating the conditional probability function. Useful for interpretability. 15 / 22

66 Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. E.g., It rains on half of the days on which you say 50% chance of rain. Note: This is achieved by a constant function: ˆη(x) = P(Y = 1), x X. So calibration is weaker than approximating the conditional probability function. Useful for interpretability. To (approximately) achieve calibration, can use post-processing on output of scoring function h. 15 / 22

67 Achieving calibration via post-processing Use (validation) data ((x i, y i)) n i=1 from X {±1} to learn g : R [0, 1] so that ˆη := g h is (approximately) calibrated. 16 / 22

68 Achieving calibration via post-processing Use (validation) data ((x i, y i)) n i=1 from X {±1} to learn g : R [0, 1] so that ˆη := g h is (approximately) calibrated. Option #1: Use flexible model like Nadaraya-Watson regression to minimize suitable loss (e.g., squared loss). E.g., n i=1 1{yi = +1} K((z h(xi))/σ) g(z) := n i=1 K((z h(xi)/σ). K is a smoothing kernel, like K(δ) = exp( δ 2 /2); σ > 0 is the bandwidth. 16 / 22

69 Achieving calibration via post-processing Use (validation) data ((x i, y i)) n i=1 from X {±1} to learn g : R [0, 1] so that ˆη := g h is (approximately) calibrated. Option #1: Use flexible model like Nadaraya-Watson regression to minimize suitable loss (e.g., squared loss). E.g., n i=1 1{yi = +1} K((z h(xi))/σ) g(z) := n i=1 K((z h(xi)/σ). K is a smoothing kernel, like K(δ) = exp( δ 2 /2); σ > 0 is the bandwidth. Option #2: isotonic regression, which finds a monotone increasing g {y i =1} g(h(x i )) h(x) 16 / 22

70 Achieving calibration via post-processing Use (validation) data ((x i, y i)) n i=1 from X {±1} to learn g : R [0, 1] so that ˆη := g h is (approximately) calibrated. Option #1: Use flexible model like Nadaraya-Watson regression to minimize suitable loss (e.g., squared loss). E.g., n i=1 1{yi = +1} K((z h(xi))/σ) g(z) := n i=1 K((z h(xi)/σ). K is a smoothing kernel, like K(δ) = exp( δ 2 /2); σ > 0 is the bandwidth. Option #2: isotonic regression, which finds a monotone increasing g {y i =1} g(h(x i )) h(x) And many others / 22

71 4. Multiclass classification

72 Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). 17 / 22

73 Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). Logistic regression: generalizes to multinomial logistic regression, where P(Y = k X = x) = MLE here is similar to MLE in binary case. e xt β k K, x R d. k =1 ext β k 17 / 22

74 Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). Logistic regression: generalizes to multinomial logistic regression, where P(Y = k X = x) = e xt β k K, x R d. k =1 ext β k MLE here is similar to MLE in binary case. Perceptron, SVM: many extensions for multiclass. 17 / 22

75 Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). Logistic regression: generalizes to multinomial logistic regression, where P(Y = k X = x) = e xt β k K, x R d. k =1 ext β k MLE here is similar to MLE in binary case. Perceptron, SVM: many extensions for multiclass. Neural networks: often formulated like multinomial logistic regression. 17 / 22

76 Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). Logistic regression: generalizes to multinomial logistic regression, where P(Y = k X = x) = e xt β k K, x R d. k =1 ext β k MLE here is similar to MLE in binary case. Perceptron, SVM: many extensions for multiclass. Neural networks: often formulated like multinomial logistic regression. Can we generically reduce multiclass classification to binary classification? 17 / 22

77 One-against-all (of the rest) reduction One-against-all training input Labeled examples ((x i, y i )) n i=1 from X {1,..., K}; learning algorithm A that takes data from X {±1} and returns a scoring function h: X R. output Scoring functions ĥ1,..., ĥk : X R. 1: for k = 1,..., K do { 2: Create data set S k = ((x i, y (k) i )) n +1 if y i = k, i=1 where y(k) i = 1 otherwise. 3: Run A on S k to obtain scoring function h k : X R. 4: end for 5: return Scoring functions ĥ1,..., ĥk. 18 / 22

78 One-against-all (of the rest) reduction One-against-all training input Labeled examples ((x i, y i )) n i=1 from X {1,..., K}; learning algorithm A that takes data from X {±1} and returns a scoring function h: X R. output Scoring functions ĥ1,..., ĥk : X R. 1: for k = 1,..., K do { 2: Create data set S k = ((x i, y (k) i )) n +1 if y i = k, i=1 where y(k) i = 1 otherwise. 3: Run A on S k to obtain scoring function h k : X R. 4: end for 5: return Scoring functions ĥ1,..., ĥk. One-against-all prediction input Scoring functions ĥ1,..., ĥk : X R; new point x X. output Predicted label ŷ {1,..., K}. 1: return ŷ arg max k {1,...,K} ĥ k (x). 18 / 22

79 When does one-against-all work well? Main idea: Suppose every ĥk corresponds to good estimate ˆη k of conditional probability function ˆη k (x) P(Y = k X = x) on average. (E.g., ˆη k (x) = 1+ĥk(x) 2 for all k = 1,..., K.) 19 / 22

80 When does one-against-all work well? Main idea: Suppose every ĥk corresponds to good estimate ˆη k of conditional probability function ˆη k (x) P(Y = k X = x) on average. (E.g., ˆη k (x) = 1+ĥk(x) 2 for all k = 1,..., K.) Behavior of one-against-all classifier ˆf similar to Bayes optimal classifier! 19 / 22

81 Masking Suppose X = R, Y = {1, 2, 3}, and P Y X=x is piece-wise constant: P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = 1 E.g., P(Y = 1 X = x) = 1{x [0.1, 0.9]}, P(Y = 2 X = x) = 1{x [1.1, 1.9]}, P(Y = 3 X = x) = 1{x [2.1, 2.9]}. 20 / 22

82 Masking Suppose X = R, Y = {1, 2, 3}, and P Y X=x is piece-wise constant: P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = 1 E.g., P(Y = 1 X = x) = 1{x [0.1, 0.9]}, P(Y = 2 X = x) = 1{x [1.1, 1.9]}, P(Y = 3 X = x) = 1{x [2.1, 2.9]}. Try to fit P(Y = k X = x) using affine functions of x, e.g., ˆη k (x) = m k x + b k for some m k, b k R. 20 / 22

83 Masking (continued) P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = Best affine approximations to P(Y = k X = x) for each k: / 22

84 Masking (continued) P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = Best affine approximations to P(Y = k X = x) for each k: The arg max k ˆη k (x) is never 2, so never predicts label / 22

85 Masking (continued) P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = Best affine approximations to P(Y = k X = x) for each k: The arg max k ˆη k (x) is never 2, so never predicts label 2. How to fix this? 21 / 22

86 Key takeaways 1. Scoring functions and thresholds; alternative performance criteria. 2. Eliciting conditional probabilities with loss functions. 3. Calibration property. 4. One-against-all approach to multiclass classification; masking. 22 / 22

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization