Classification objectives COMS 4771

Size: px
Start display at page:

Download "Classification objectives COMS 4771"

Transcription

1 Classification objectives COMS 4771

2 1. Recap: binary classification

3 Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22

4 Scoring functions Consider binary classification problems with Y = { 1, +1}. { +1 if h(x) > 0 WLOG, can write a classifier as x 1 if h(x) 0 for some scoring function h: X R. 1 / 22

5 Scoring functions Consider binary classification problems with Y = { 1, +1}. { +1 if h(x) > 0 WLOG, can write a classifier as x 1 if h(x) 0 for some scoring function h: X R. Some examples: For linear classifiers: h(x) = x T w For Bayes optimal classifier: h(x) = P(Y = 1 X = x) 1/2 1 / 22

6 Scoring functions Consider binary classification problems with Y = { 1, +1}. { +1 if h(x) > 0 WLOG, can write a classifier as x 1 if h(x) 0 for some scoring function h: X R. Some examples: For linear classifiers: h(x) = x T w For Bayes optimal classifier: h(x) = P(Y = 1 X = x) 1/2 [ ( ) ] Expected zero-one loss is E l 0/1 Y h(x) where l 0/1 (z) = 1{z 0}: l 0/1 (z) z 1 / 22

7 Zero-one loss and some surrogate losses /1 hinge squared logistic exponential z Some surrogate losses that upper-bound l 0/1 : hinge l hinge (z) = [1 z] + squared l sq(z) = (1 z) 2 logistic l logistic (z) = log 2 (1 + e z ) exponential l exp(z) = e z 2 / 22

8 Zero-one loss and some surrogate losses /1 hinge squared logistic exponential z Some surrogate losses that upper-bound l 0/1 : hinge l hinge (z) = [1 z] + squared l sq(z) = (1 z) 2 logistic l logistic (z) = log 2 (1 + e z ) exponential l exp(z) = e z Note: when y {±1} and p R, l sq(yp) = (1 yp) 2 = (y p) 2. 2 / 22

9 Beyond binary classification What if different types of mistakes have different costs? What if we want the (conditional) probability of a positive label? What if there are more than two classes? 3 / 22

10 2. Cost-sensitive classification

11 Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 4 / 22

12 Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) 4 / 22

13 Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) Cost-sensitive zero-one loss: l (c) 0/1 (y, p) = ( 1{y = 1} (1 c) + 1{y = 1} c ) l 0/1 (yp). 4 / 22

14 Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) Cost-sensitive l-loss (for loss l: R R): l (c) (y, p) = ( 1{y = 1} (1 c) + 1{y = 1} c ) l(yp). 4 / 22

15 Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) Cost-sensitive l-loss (for loss l: R R): l (c) (y, p) = ( 1{y = 1} (1 c) + 1{y = 1} c ) l(yp). Fact: if l is convex, then so is l (c) (y, ) for each y {±1}. 4 / 22

16 Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) Cost-sensitive l-loss (for loss l: R R): l (c) (y, p) = ( 1{y = 1} (1 c) + 1{y = 1} c ) l(yp). Fact: if l is convex, then so is l (c) (y, ) for each y {±1}. Cost-sensitive (empirical) l-risk of scoring function h: X R: [ ] R (c) (h) := E l (c) (Y, h(x)) Our actual objective. R (c) (h) := 1 n n l (c) (Y i, h(x i)) i=1 What we can try to minimize. 4 / 22

17 Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? 5 / 22

18 Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. 5 / 22

19 Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. Conditional on X = x, the minimizer of conditional cost-sensitive risk [ ] ŷ E l (c) (Y, ŷ) X = x = η(x) (1 c) 1{ŷ = 1} + (1 η(x)) c 1{ŷ = +1} is ŷ := { +1 if η(x) (1 c) > (1 η(x)) c, 1 otherwise. 5 / 22

20 Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. Conditional on X = x, the minimizer of conditional cost-sensitive risk [ ] ŷ E l (c) (Y, ŷ) X = x = η(x) (1 c) 1{ŷ = 1} + (1 η(x)) c 1{ŷ = +1} is ŷ := { +1 if η(x) > c, 1 otherwise. 5 / 22

21 Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. Conditional on X = x, the minimizer of conditional cost-sensitive risk [ ] ŷ E l (c) (Y, ŷ) X = x = η(x) (1 c) 1{ŷ = 1} + (1 η(x)) c 1{ŷ = +1} is ŷ := { +1 if η(x) > c, 1 otherwise. Therefore, the classifier based on scoring function h(x) := η(x) c, x X has the smallest cost-sensitive risk R (c). 5 / 22

22 Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. Conditional on X = x, the minimizer of conditional cost-sensitive risk [ ] ŷ E l (c) (Y, ŷ) X = x = η(x) (1 c) 1{ŷ = 1} + (1 η(x)) c 1{ŷ = +1} is ŷ := { +1 if η(x) > c, 1 otherwise. Therefore, the classifier based on scoring function h(x) := η(x) c, x X has the smallest cost-sensitive risk R (c). But where does c comes from? 5 / 22

23 Common performance criteria 6 / 22

24 Common performance criteria Precision: P(Y = +1 h(x) > 0). 6 / 22

25 Common performance criteria Precision: P(Y = +1 h(x) > 0). Recall (a.k.a. True Positive Rate): P(h(X) > 0 Y = +1). Same as 1 False Negative Rate. 6 / 22

26 Common performance criteria Precision: P(Y = +1 h(x) > 0). Recall (a.k.a. True Positive Rate): P(h(X) > 0 Y = +1). Same as 1 False Negative Rate. False Positive Rate: P(h(X) > 0 Y = 1). Same as 1 True Negative Rate. 6 / 22

27 Common performance criteria Precision: P(Y = +1 h(x) > 0). Recall (a.k.a. True Positive Rate): P(h(X) > 0 Y = +1). Same as 1 False Negative Rate. False Positive Rate: P(h(X) > 0 Y = 1). Same as 1 True Negative Rate / 22

28 Example: balanced error rate Suppose you are care about Balanced Error Rate (BER): BER := 1 2 False Negative Rate + 1 False Positive Rate. 2 Which cost-sensitive risk should you (try to) minimize? 7 / 22

29 Example: balanced error rate Suppose you are care about Balanced Error Rate (BER): BER := 1 2 False Negative Rate + 1 False Positive Rate. 2 Which cost-sensitive risk should you (try to) minimize? 2 BER = P ( h(x) 0 Y = +1 ) + P ( h(x) > 0 Y = 1 ) }{{}}{{} FNR FPR 7 / 22

30 Example: balanced error rate Suppose you are care about Balanced Error Rate (BER): BER := 1 2 False Negative Rate + 1 False Positive Rate. 2 Which cost-sensitive risk should you (try to) minimize? 2 BER = P ( h(x) 0 Y = +1 ) }{{} FNR where π := P(Y = +1). + P ( h(x) > 0 Y = 1 ) }{{} FPR = 1 π P ( h(x) 0 Y = +1 ) π P ( h(x) > 0 Y = 1 ) 7 / 22

31 Example: balanced error rate Suppose you are care about Balanced Error Rate (BER): BER := 1 2 False Negative Rate + 1 False Positive Rate. 2 Which cost-sensitive risk should you (try to) minimize? 2 BER = P ( h(x) 0 Y = +1 ) }{{} FNR + P ( h(x) > 0 Y = 1 ) }{{} FPR = 1 π P ( h(x) 0 Y = +1 ) π P ( h(x) > 0 Y = 1 ) where π := P(Y = +1). So use R (c) with c := (which you can estimate). 1 1 π 1 1 π + 1 π = π = P(Y = +1) 7 / 22

32 Even more general setting Distribution over importance-weighted examples: Every example comes with example-specific weight: (X, Y, W ) P where W is the (non-negative) importance weight of the example. 8 / 22

33 Even more general setting Distribution over importance-weighted examples: Every example comes with example-specific weight: (X, Y, W ) P where W is the (non-negative) importance weight of the example. Importance-weighted l-risk of h is R(h) := E [ W l(y h(x)) ]. 8 / 22

34 Even more general setting Distribution over importance-weighted examples: Every example comes with example-specific weight: (X, Y, W ) P where W is the (non-negative) importance weight of the example. Importance-weighted l-risk of h is R(h) := E [ W l(y h(x)) ]. (This comes up in Boosting, online decision-making, etc.) 8 / 22

35 Even more general setting Distribution over importance-weighted examples: Every example comes with example-specific weight: (X, Y, W ) P where W is the (non-negative) importance weight of the example. Importance-weighted l-risk of h is R(h) := E [ W l(y h(x)) ]. (This comes up in Boosting, online decision-making, etc.) Estimate using importance-weighted sample (X 1, Y 1, W 1),..., (X n, Y n, W n): R(h) := 1 n n W i l(y ih(x i)). i=1 8 / 22

36 Example: importance weights Suppose you have two ways of collecting training data: Type 1: randomly draw (X, Y ) P. (E.g., all s) 9 / 22

37 Example: importance weights Suppose you have two ways of collecting training data: Type 1: randomly draw (X, Y ) P. (E.g., all s) Type 2 (using filter φ: X {0, 1}): draw (X, Y ) P, return (X, Y ) if φ(x) = 1, repeat otherwise. (E.g., s that made it past the spam filter) Can we estimate E (X,Y ) P l(y h(x)) only using examples of Type 2? 9 / 22

38 Example: importance weights Suppose you have two ways of collecting training data: Type 1: randomly draw (X, Y ) P. (E.g., all s) Type 2 (using filter φ: X {0, 1}): draw (X, Y ) P, return (X, Y ) if φ(x) = 1, repeat otherwise. (E.g., s that made it past the spam filter) Can we estimate E (X,Y ) P l(y h(x)) only using examples of Type 2? Type 3 (using probabilistic filter q : X (0, 1)): draw (X, Y ) P, toss a coin with heads probability q(x), return (X, Y ) if outcome B is heads, repeat otherwise. (E.g., s that made it past the randomized filter) Can we estimate E (X,Y ) P l(y h(x)) only using examples of Type 3? 9 / 22

39 3. Conditional probability estimation

40 Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. 10 / 22

41 Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. Example: Suppose I have a classifier f : X { 1, +1}. False positives cost c, false negatives cost 1 c. 10 / 22

42 Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. Example: Suppose I have a classifier f : X { 1, +1}. False positives cost c, false negatives cost 1 c. Given x, how to estimate expected cost of prediction f(x)? 10 / 22

43 Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. Example: Suppose I have a classifier f : X { 1, +1}. False positives cost c, false negatives cost 1 c. Given x, how to estimate expected cost of prediction f(x)? Expected cost: { (1 c) P(Y = +1 X = x) if f(x) = 1 c P(Y = 1 X = x) if f(x) = / 22

44 Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. Example: Suppose I have a classifier f : X { 1, +1}. False positives cost c, false negatives cost 1 c. Given x, how to estimate expected cost of prediction f(x)? Expected cost: { (1 c) P(Y = +1 X = x) if f(x) = 1 c P(Y = 1 X = x) if f(x) = +1 To estimate expected cost, suffices to estimate η(x) = P(Y = +1 X = x). 10 / 22

45 Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? 11 / 22

46 Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? Squared loss: l sq(z) = (1 z) 2 11 / 22

47 Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? Squared loss: l sq(z) = (1 z) 2 E[l sq(y h(x)) X = x] = η(x) (1 h(x)) 2 + (1 η(x)) (1 + h(x)) / 22

48 Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? Squared loss: l sq(z) = (1 z) 2 E[l sq(y h(x)) X = x] = η(x) (1 h(x)) 2 + (1 η(x)) (1 + h(x)) 2. Using calculus, can see that this is minimized by h s.t. h(x) = 2η(x) 1, i.e., η(x) = 1+h(x) / 22

49 Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? Squared loss: l sq(z) = (1 z) 2 E[l sq(y h(x)) X = x] = η(x) (1 h(x)) 2 + (1 η(x)) (1 + h(x)) 2. Using calculus, can see that this is minimized by h s.t. h(x) = 2η(x) 1, i.e., η(x) = 1+h(x) 2. Recipe using plug-in principle: 1. Find scoring function ĥ: X R to (approximately) minimize (empirical) squared loss risk. 2. Construct estimate ˆη via ˆη(x) := 1 + ĥ(x), x X. 2 (May want to clip output to range [0, 1].) 11 / 22

50 Eliciting conditional probabilities (again) 12 / 22

51 Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) 12 / 22

52 Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) E[l logistic (Y h(x)) X = x] = η(x) log 2 (1+e h(x) )+(1 η(x)) log 2 (1+e h(x) ). 12 / 22

53 Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) E[l logistic (Y h(x)) X = x] = η(x) log 2 (1+e h(x) )+(1 η(x)) log 2 (1+e h(x) ). This is minimized by h s.t. logistic(h(x)) = η(x). 12 / 22

54 Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) E[l logistic (Y h(x)) X = x] = η(x) log 2 (1+e h(x) )+(1 η(x)) log 2 (1+e h(x) ). This is minimized by h s.t. logistic(h(x)) = η(x). Recipe using plug-in principle: 1. Find scoring function ĥ: X R to (approximately) minimize (empirical) logistic loss risk. 2. Construct estimate ˆη via ˆη(x) := logistic(h(x)), x X. 12 / 22

55 Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) E[l logistic (Y h(x)) X = x] = η(x) log 2 (1+e h(x) )+(1 η(x)) log 2 (1+e h(x) ). This is minimized by h s.t. logistic(h(x)) = η(x). Recipe using plug-in principle: 1. Find scoring function ĥ: X R to (approximately) minimize (empirical) logistic loss risk. 2. Construct estimate ˆη via ˆη(x) := logistic(h(x)), x X. This works for many other losses / 22

56 Deficiency of hinge loss But not hinge loss: l hinge (z) = max{0, 1 z}. 13 / 22

57 Deficiency of hinge loss But not hinge loss: l hinge (z) = max{0, 1 z}. E[l hinge (Y h(x)) X = x] is minimized by h s.t. h(x) = sign(2η(x) 1). conditional hinge loss risk (x)=1/4 (x)=1/3 (x)=2/3 (x)=3/ h(x) 13 / 22

58 Deficiency of hinge loss But not hinge loss: l hinge (z) = max{0, 1 z}. E[l hinge (Y h(x)) X = x] is minimized by h s.t. h(x) = sign(2η(x) 1). conditional hinge loss risk (x)=1/4 (x)=1/3 (x)=2/3 (x)=3/ h(x) Cannot recover η(x) from the minimizing h(x). 13 / 22

59 Difficulty of estimating conditional probabilities Accurately estimating conditional probabilities is harder than learning a good classifier. Example: In general, min E[lsq(Y X T w)] min E[l sq(y h(x))]. w R d h : R d R Optimal scoring function x 2η(x) 1 is not necessarily well-approximated by a linear function. 14 / 22

60 Difficulty of estimating conditional probabilities Accurately estimating conditional probabilities is harder than learning a good classifier. Example: In general, min E[lsq(Y X T w)] min E[l sq(y h(x))]. w R d h : R d R Optimal scoring function x 2η(x) 1 is not necessarily well-approximated by a linear function. (We ll see an example of this later.) 14 / 22

61 Difficulty of estimating conditional probabilities Accurately estimating conditional probabilities is harder than learning a good classifier. Example: In general, min E[lsq(Y X T w)] min E[l sq(y h(x))]. w R d h : R d R Optimal scoring function x 2η(x) 1 is not necessarily well-approximated by a linear function. (We ll see an example of this later.) Common remedies: use more flexible models (e.g., polynomial functions). 14 / 22

62 Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. 15 / 22

63 Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. E.g., It rains on half of the days on which you say 50% chance of rain. 15 / 22

64 Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. E.g., It rains on half of the days on which you say 50% chance of rain. Note: This is achieved by a constant function: ˆη(x) = P(Y = 1), x X. So calibration is weaker than approximating the conditional probability function. 15 / 22

65 Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. E.g., It rains on half of the days on which you say 50% chance of rain. Note: This is achieved by a constant function: ˆη(x) = P(Y = 1), x X. So calibration is weaker than approximating the conditional probability function. Useful for interpretability. 15 / 22

66 Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. E.g., It rains on half of the days on which you say 50% chance of rain. Note: This is achieved by a constant function: ˆη(x) = P(Y = 1), x X. So calibration is weaker than approximating the conditional probability function. Useful for interpretability. To (approximately) achieve calibration, can use post-processing on output of scoring function h. 15 / 22

67 Achieving calibration via post-processing Use (validation) data ((x i, y i)) n i=1 from X {±1} to learn g : R [0, 1] so that ˆη := g h is (approximately) calibrated. 16 / 22

68 Achieving calibration via post-processing Use (validation) data ((x i, y i)) n i=1 from X {±1} to learn g : R [0, 1] so that ˆη := g h is (approximately) calibrated. Option #1: Use flexible model like Nadaraya-Watson regression to minimize suitable loss (e.g., squared loss). E.g., n i=1 1{yi = +1} K((z h(xi))/σ) g(z) := n i=1 K((z h(xi)/σ). K is a smoothing kernel, like K(δ) = exp( δ 2 /2); σ > 0 is the bandwidth. 16 / 22

69 Achieving calibration via post-processing Use (validation) data ((x i, y i)) n i=1 from X {±1} to learn g : R [0, 1] so that ˆη := g h is (approximately) calibrated. Option #1: Use flexible model like Nadaraya-Watson regression to minimize suitable loss (e.g., squared loss). E.g., n i=1 1{yi = +1} K((z h(xi))/σ) g(z) := n i=1 K((z h(xi)/σ). K is a smoothing kernel, like K(δ) = exp( δ 2 /2); σ > 0 is the bandwidth. Option #2: isotonic regression, which finds a monotone increasing g {y i =1} g(h(x i )) h(x) 16 / 22

70 Achieving calibration via post-processing Use (validation) data ((x i, y i)) n i=1 from X {±1} to learn g : R [0, 1] so that ˆη := g h is (approximately) calibrated. Option #1: Use flexible model like Nadaraya-Watson regression to minimize suitable loss (e.g., squared loss). E.g., n i=1 1{yi = +1} K((z h(xi))/σ) g(z) := n i=1 K((z h(xi)/σ). K is a smoothing kernel, like K(δ) = exp( δ 2 /2); σ > 0 is the bandwidth. Option #2: isotonic regression, which finds a monotone increasing g {y i =1} g(h(x i )) h(x) And many others / 22

71 4. Multiclass classification

72 Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). 17 / 22

73 Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). Logistic regression: generalizes to multinomial logistic regression, where P(Y = k X = x) = MLE here is similar to MLE in binary case. e xt β k K, x R d. k =1 ext β k 17 / 22

74 Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). Logistic regression: generalizes to multinomial logistic regression, where P(Y = k X = x) = e xt β k K, x R d. k =1 ext β k MLE here is similar to MLE in binary case. Perceptron, SVM: many extensions for multiclass. 17 / 22

75 Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). Logistic regression: generalizes to multinomial logistic regression, where P(Y = k X = x) = e xt β k K, x R d. k =1 ext β k MLE here is similar to MLE in binary case. Perceptron, SVM: many extensions for multiclass. Neural networks: often formulated like multinomial logistic regression. 17 / 22

76 Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). Logistic regression: generalizes to multinomial logistic regression, where P(Y = k X = x) = e xt β k K, x R d. k =1 ext β k MLE here is similar to MLE in binary case. Perceptron, SVM: many extensions for multiclass. Neural networks: often formulated like multinomial logistic regression. Can we generically reduce multiclass classification to binary classification? 17 / 22

77 One-against-all (of the rest) reduction One-against-all training input Labeled examples ((x i, y i )) n i=1 from X {1,..., K}; learning algorithm A that takes data from X {±1} and returns a scoring function h: X R. output Scoring functions ĥ1,..., ĥk : X R. 1: for k = 1,..., K do { 2: Create data set S k = ((x i, y (k) i )) n +1 if y i = k, i=1 where y(k) i = 1 otherwise. 3: Run A on S k to obtain scoring function h k : X R. 4: end for 5: return Scoring functions ĥ1,..., ĥk. 18 / 22

78 One-against-all (of the rest) reduction One-against-all training input Labeled examples ((x i, y i )) n i=1 from X {1,..., K}; learning algorithm A that takes data from X {±1} and returns a scoring function h: X R. output Scoring functions ĥ1,..., ĥk : X R. 1: for k = 1,..., K do { 2: Create data set S k = ((x i, y (k) i )) n +1 if y i = k, i=1 where y(k) i = 1 otherwise. 3: Run A on S k to obtain scoring function h k : X R. 4: end for 5: return Scoring functions ĥ1,..., ĥk. One-against-all prediction input Scoring functions ĥ1,..., ĥk : X R; new point x X. output Predicted label ŷ {1,..., K}. 1: return ŷ arg max k {1,...,K} ĥ k (x). 18 / 22

79 When does one-against-all work well? Main idea: Suppose every ĥk corresponds to good estimate ˆη k of conditional probability function ˆη k (x) P(Y = k X = x) on average. (E.g., ˆη k (x) = 1+ĥk(x) 2 for all k = 1,..., K.) 19 / 22

80 When does one-against-all work well? Main idea: Suppose every ĥk corresponds to good estimate ˆη k of conditional probability function ˆη k (x) P(Y = k X = x) on average. (E.g., ˆη k (x) = 1+ĥk(x) 2 for all k = 1,..., K.) Behavior of one-against-all classifier ˆf similar to Bayes optimal classifier! 19 / 22

81 Masking Suppose X = R, Y = {1, 2, 3}, and P Y X=x is piece-wise constant: P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = 1 E.g., P(Y = 1 X = x) = 1{x [0.1, 0.9]}, P(Y = 2 X = x) = 1{x [1.1, 1.9]}, P(Y = 3 X = x) = 1{x [2.1, 2.9]}. 20 / 22

82 Masking Suppose X = R, Y = {1, 2, 3}, and P Y X=x is piece-wise constant: P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = 1 E.g., P(Y = 1 X = x) = 1{x [0.1, 0.9]}, P(Y = 2 X = x) = 1{x [1.1, 1.9]}, P(Y = 3 X = x) = 1{x [2.1, 2.9]}. Try to fit P(Y = k X = x) using affine functions of x, e.g., ˆη k (x) = m k x + b k for some m k, b k R. 20 / 22

83 Masking (continued) P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = Best affine approximations to P(Y = k X = x) for each k: / 22

84 Masking (continued) P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = Best affine approximations to P(Y = k X = x) for each k: The arg max k ˆη k (x) is never 2, so never predicts label / 22

85 Masking (continued) P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = Best affine approximations to P(Y = k X = x) for each k: The arg max k ˆη k (x) is never 2, so never predicts label 2. How to fix this? 21 / 22

86 Key takeaways 1. Scoring functions and thresholds; alternative performance criteria. 2. Eliciting conditional probabilities with loss functions. 3. Calibration property. 4. One-against-all approach to multiclass classification; masking. 22 / 22

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

Lecture 10 February 23

Lecture 10 February 23 EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Lecture 3: Multiclass Classification

Lecture 3: Multiclass Classification Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

Logistic regression and linear classifiers COMS 4771

Logistic regression and linear classifiers COMS 4771 Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random

More information

Classification and Pattern Recognition

Classification and Pattern Recognition Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations

More information

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM Due: Monday, April 11, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to

More information

Learning from Corrupted Binary Labels via Class-Probability Estimation

Learning from Corrupted Binary Labels via Class-Probability Estimation Learning from Corrupted Binary Labels via Class-Probability Estimation Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson xxx National ICT Australia and The Australian National

More information

Classification Based on Probability

Classification Based on Probability Logistic Regression These slides were assembled by Byron Boots, with only minor modifications from Eric Eaton s slides and grateful acknowledgement to the many others who made their course materials freely

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Logistic Regression Trained with Different Loss Functions. Discussion

Logistic Regression Trained with Different Loss Functions. Discussion Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =

More information

Neural networks COMS 4771

Neural networks COMS 4771 Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

1 Review of Winnow Algorithm

1 Review of Winnow Algorithm COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # 17 Scribe: Xingyuan Fang, Ethan April 9th, 2013 1 Review of Winnow Algorithm We have studied Winnow algorithm in Algorithm 1. Algorithm

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)

More information

Homework 6. Due: 10am Thursday 11/30/17

Homework 6. Due: 10am Thursday 11/30/17 Homework 6 Due: 10am Thursday 11/30/17 1. Hinge loss vs. logistic loss. In class we defined hinge loss l hinge (x, y; w) = (1 yw T x) + and logistic loss l logistic (x, y; w) = log(1 + exp ( yw T x ) ).

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015 Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary

More information

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

AdaBoost and other Large Margin Classifiers: Convexity in Classification

AdaBoost and other Large Margin Classifiers: Convexity in Classification AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements HW3 due date as posted. make sure

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Machine Learning and Data Mining. Linear classification. Kalev Kask

Machine Learning and Data Mining. Linear classification. Kalev Kask Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Day 4: Classification, support vector machines

Day 4: Classification, support vector machines Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018 Topics so far

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Minimax risk bounds for linear threshold functions

Minimax risk bounds for linear threshold functions CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability

More information

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Lecture Notes 15 Prediction Chapters 13, 22, 20.4. Lecture Notes 15 Prediction Chapters 13, 22, 20.4. 1 Introduction Prediction is covered in detail in 36-707, 36-701, 36-715, 10/36-702. Here, we will just give an introduction. We observe training data

More information

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert Logistic regression Comments Mini-review and feedback These are equivalent: x > w = w > x Clarification: this course is about getting you to be able to think as a machine learning expert There has to be

More information

Linear Classifiers. Michael Collins. January 18, 2012

Linear Classifiers. Michael Collins. January 18, 2012 Linear Classifiers Michael Collins January 18, 2012 Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm Classification Problems: An Example Goal: build a system that

More information

Linear Methods for Classification

Linear Methods for Classification Linear Methods for Classification Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Classification Supervised learning Training data: {(x 1, g 1 ), (x 2, g 2 ),..., (x

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,

More information

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com Logistic Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Surrogate regret bounds for generalized classification performance metrics

Surrogate regret bounds for generalized classification performance metrics Surrogate regret bounds for generalized classification performance metrics Wojciech Kotłowski Krzysztof Dembczyński Poznań University of Technology PL-SIGML, Częstochowa, 14.04.2016 1 / 36 Motivation 2

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Lecture 3: Logistic Regression. Feng Li. Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Algorithms for Predicting Structured Data

Algorithms for Predicting Structured Data 1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Linear regression COMS 4771

Linear regression COMS 4771 Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between

More information

Convex optimization COMS 4771

Convex optimization COMS 4771 Convex optimization COMS 4771 1. Recap: learning via optimization Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + 1 / 15 Soft-margin

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Machine Learning: A Statistics and Optimization Perspective

Machine Learning: A Statistics and Optimization Perspective Machine Learning: A Statistics and Optimization Perspective Nan Ye Mathematical Sciences School Queensland University of Technology 1 / 109 What is Machine Learning? 2 / 109 Machine Learning Machine learning

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

Natural Language Processing (CSEP 517): Text Classification

Natural Language Processing (CSEP 517): Text Classification Natural Language Processing (CSEP 517): Text Classification Noah Smith c 2017 University of Washington nasmith@cs.washington.edu April 10, 2017 1 / 71 To-Do List Online quiz: due Sunday Read: Jurafsky

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch. School of Computer Science 10-701 Introduction to Machine Learning aïve Bayes Readings: Mitchell Ch. 6.1 6.10 Murphy Ch. 3 Matt Gormley Lecture 3 September 14, 2016 1 Homewor 1: due 9/26/16 Project Proposal:

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Online Advertising is Big Business

Online Advertising is Big Business Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

GRADIENT DESCENT AND LOCAL MINIMA

GRADIENT DESCENT AND LOCAL MINIMA GRADIENT DESCENT AND LOCAL MINIMA 25 20 5 15 10 3 2 1 1 2 5 2 2 4 5 5 10 Suppose for both functions above, gradient descent is started at the point marked red. It will walk downhill as far as possible,

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS446: Machine Learning Fall Final Exam. December 6 th, 2016 CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

Tufts COMP 135: Introduction to Machine Learning

Tufts COMP 135: Introduction to Machine Learning Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)

More information

Gradient-Based Learning. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012) Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information