Classification objectives COMS 4771
|
|
- Irma Morton
- 5 years ago
- Views:
Transcription
1 Classification objectives COMS 4771
2 1. Recap: binary classification
3 Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22
4 Scoring functions Consider binary classification problems with Y = { 1, +1}. { +1 if h(x) > 0 WLOG, can write a classifier as x 1 if h(x) 0 for some scoring function h: X R. 1 / 22
5 Scoring functions Consider binary classification problems with Y = { 1, +1}. { +1 if h(x) > 0 WLOG, can write a classifier as x 1 if h(x) 0 for some scoring function h: X R. Some examples: For linear classifiers: h(x) = x T w For Bayes optimal classifier: h(x) = P(Y = 1 X = x) 1/2 1 / 22
6 Scoring functions Consider binary classification problems with Y = { 1, +1}. { +1 if h(x) > 0 WLOG, can write a classifier as x 1 if h(x) 0 for some scoring function h: X R. Some examples: For linear classifiers: h(x) = x T w For Bayes optimal classifier: h(x) = P(Y = 1 X = x) 1/2 [ ( ) ] Expected zero-one loss is E l 0/1 Y h(x) where l 0/1 (z) = 1{z 0}: l 0/1 (z) z 1 / 22
7 Zero-one loss and some surrogate losses /1 hinge squared logistic exponential z Some surrogate losses that upper-bound l 0/1 : hinge l hinge (z) = [1 z] + squared l sq(z) = (1 z) 2 logistic l logistic (z) = log 2 (1 + e z ) exponential l exp(z) = e z 2 / 22
8 Zero-one loss and some surrogate losses /1 hinge squared logistic exponential z Some surrogate losses that upper-bound l 0/1 : hinge l hinge (z) = [1 z] + squared l sq(z) = (1 z) 2 logistic l logistic (z) = log 2 (1 + e z ) exponential l exp(z) = e z Note: when y {±1} and p R, l sq(yp) = (1 yp) 2 = (y p) 2. 2 / 22
9 Beyond binary classification What if different types of mistakes have different costs? What if we want the (conditional) probability of a positive label? What if there are more than two classes? 3 / 22
10 2. Cost-sensitive classification
11 Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 4 / 22
12 Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) 4 / 22
13 Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) Cost-sensitive zero-one loss: l (c) 0/1 (y, p) = ( 1{y = 1} (1 c) + 1{y = 1} c ) l 0/1 (yp). 4 / 22
14 Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) Cost-sensitive l-loss (for loss l: R R): l (c) (y, p) = ( 1{y = 1} (1 c) + 1{y = 1} c ) l(yp). 4 / 22
15 Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) Cost-sensitive l-loss (for loss l: R R): l (c) (y, p) = ( 1{y = 1} (1 c) + 1{y = 1} c ) l(yp). Fact: if l is convex, then so is l (c) (y, ) for each y {±1}. 4 / 22
16 Cost-sensitive classification Often have different costs for different kinds of mistakes. For c [0, 1]: ŷ = 1 ŷ = +1 y = 1 0 c y = +1 1 c 0 (Why can we restrict attention to c [0, 1]?) Cost-sensitive l-loss (for loss l: R R): l (c) (y, p) = ( 1{y = 1} (1 c) + 1{y = 1} c ) l(yp). Fact: if l is convex, then so is l (c) (y, ) for each y {±1}. Cost-sensitive (empirical) l-risk of scoring function h: X R: [ ] R (c) (h) := E l (c) (Y, h(x)) Our actual objective. R (c) (h) := 1 n n l (c) (Y i, h(x i)) i=1 What we can try to minimize. 4 / 22
17 Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? 5 / 22
18 Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. 5 / 22
19 Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. Conditional on X = x, the minimizer of conditional cost-sensitive risk [ ] ŷ E l (c) (Y, ŷ) X = x = η(x) (1 c) 1{ŷ = 1} + (1 η(x)) c 1{ŷ = +1} is ŷ := { +1 if η(x) (1 c) > (1 η(x)) c, 1 otherwise. 5 / 22
20 Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. Conditional on X = x, the minimizer of conditional cost-sensitive risk [ ] ŷ E l (c) (Y, ŷ) X = x = η(x) (1 c) 1{ŷ = 1} + (1 η(x)) c 1{ŷ = +1} is ŷ := { +1 if η(x) > c, 1 otherwise. 5 / 22
21 Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. Conditional on X = x, the minimizer of conditional cost-sensitive risk [ ] ŷ E l (c) (Y, ŷ) X = x = η(x) (1 c) 1{ŷ = 1} + (1 η(x)) c 1{ŷ = +1} is ŷ := { +1 if η(x) > c, 1 otherwise. Therefore, the classifier based on scoring function h(x) := η(x) c, x X has the smallest cost-sensitive risk R (c). 5 / 22
22 Minimizer of cost-sensitive risk What is the optimal classifier for cost-sensitive (zero-one loss) risk? Let η(x) := P(Y = 1 X = x) for x X. Conditional on X = x, the minimizer of conditional cost-sensitive risk [ ] ŷ E l (c) (Y, ŷ) X = x = η(x) (1 c) 1{ŷ = 1} + (1 η(x)) c 1{ŷ = +1} is ŷ := { +1 if η(x) > c, 1 otherwise. Therefore, the classifier based on scoring function h(x) := η(x) c, x X has the smallest cost-sensitive risk R (c). But where does c comes from? 5 / 22
23 Common performance criteria 6 / 22
24 Common performance criteria Precision: P(Y = +1 h(x) > 0). 6 / 22
25 Common performance criteria Precision: P(Y = +1 h(x) > 0). Recall (a.k.a. True Positive Rate): P(h(X) > 0 Y = +1). Same as 1 False Negative Rate. 6 / 22
26 Common performance criteria Precision: P(Y = +1 h(x) > 0). Recall (a.k.a. True Positive Rate): P(h(X) > 0 Y = +1). Same as 1 False Negative Rate. False Positive Rate: P(h(X) > 0 Y = 1). Same as 1 True Negative Rate. 6 / 22
27 Common performance criteria Precision: P(Y = +1 h(x) > 0). Recall (a.k.a. True Positive Rate): P(h(X) > 0 Y = +1). Same as 1 False Negative Rate. False Positive Rate: P(h(X) > 0 Y = 1). Same as 1 True Negative Rate / 22
28 Example: balanced error rate Suppose you are care about Balanced Error Rate (BER): BER := 1 2 False Negative Rate + 1 False Positive Rate. 2 Which cost-sensitive risk should you (try to) minimize? 7 / 22
29 Example: balanced error rate Suppose you are care about Balanced Error Rate (BER): BER := 1 2 False Negative Rate + 1 False Positive Rate. 2 Which cost-sensitive risk should you (try to) minimize? 2 BER = P ( h(x) 0 Y = +1 ) + P ( h(x) > 0 Y = 1 ) }{{}}{{} FNR FPR 7 / 22
30 Example: balanced error rate Suppose you are care about Balanced Error Rate (BER): BER := 1 2 False Negative Rate + 1 False Positive Rate. 2 Which cost-sensitive risk should you (try to) minimize? 2 BER = P ( h(x) 0 Y = +1 ) }{{} FNR where π := P(Y = +1). + P ( h(x) > 0 Y = 1 ) }{{} FPR = 1 π P ( h(x) 0 Y = +1 ) π P ( h(x) > 0 Y = 1 ) 7 / 22
31 Example: balanced error rate Suppose you are care about Balanced Error Rate (BER): BER := 1 2 False Negative Rate + 1 False Positive Rate. 2 Which cost-sensitive risk should you (try to) minimize? 2 BER = P ( h(x) 0 Y = +1 ) }{{} FNR + P ( h(x) > 0 Y = 1 ) }{{} FPR = 1 π P ( h(x) 0 Y = +1 ) π P ( h(x) > 0 Y = 1 ) where π := P(Y = +1). So use R (c) with c := (which you can estimate). 1 1 π 1 1 π + 1 π = π = P(Y = +1) 7 / 22
32 Even more general setting Distribution over importance-weighted examples: Every example comes with example-specific weight: (X, Y, W ) P where W is the (non-negative) importance weight of the example. 8 / 22
33 Even more general setting Distribution over importance-weighted examples: Every example comes with example-specific weight: (X, Y, W ) P where W is the (non-negative) importance weight of the example. Importance-weighted l-risk of h is R(h) := E [ W l(y h(x)) ]. 8 / 22
34 Even more general setting Distribution over importance-weighted examples: Every example comes with example-specific weight: (X, Y, W ) P where W is the (non-negative) importance weight of the example. Importance-weighted l-risk of h is R(h) := E [ W l(y h(x)) ]. (This comes up in Boosting, online decision-making, etc.) 8 / 22
35 Even more general setting Distribution over importance-weighted examples: Every example comes with example-specific weight: (X, Y, W ) P where W is the (non-negative) importance weight of the example. Importance-weighted l-risk of h is R(h) := E [ W l(y h(x)) ]. (This comes up in Boosting, online decision-making, etc.) Estimate using importance-weighted sample (X 1, Y 1, W 1),..., (X n, Y n, W n): R(h) := 1 n n W i l(y ih(x i)). i=1 8 / 22
36 Example: importance weights Suppose you have two ways of collecting training data: Type 1: randomly draw (X, Y ) P. (E.g., all s) 9 / 22
37 Example: importance weights Suppose you have two ways of collecting training data: Type 1: randomly draw (X, Y ) P. (E.g., all s) Type 2 (using filter φ: X {0, 1}): draw (X, Y ) P, return (X, Y ) if φ(x) = 1, repeat otherwise. (E.g., s that made it past the spam filter) Can we estimate E (X,Y ) P l(y h(x)) only using examples of Type 2? 9 / 22
38 Example: importance weights Suppose you have two ways of collecting training data: Type 1: randomly draw (X, Y ) P. (E.g., all s) Type 2 (using filter φ: X {0, 1}): draw (X, Y ) P, return (X, Y ) if φ(x) = 1, repeat otherwise. (E.g., s that made it past the spam filter) Can we estimate E (X,Y ) P l(y h(x)) only using examples of Type 2? Type 3 (using probabilistic filter q : X (0, 1)): draw (X, Y ) P, toss a coin with heads probability q(x), return (X, Y ) if outcome B is heads, repeat otherwise. (E.g., s that made it past the randomized filter) Can we estimate E (X,Y ) P l(y h(x)) only using examples of Type 3? 9 / 22
39 3. Conditional probability estimation
40 Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. 10 / 22
41 Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. Example: Suppose I have a classifier f : X { 1, +1}. False positives cost c, false negatives cost 1 c. 10 / 22
42 Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. Example: Suppose I have a classifier f : X { 1, +1}. False positives cost c, false negatives cost 1 c. Given x, how to estimate expected cost of prediction f(x)? 10 / 22
43 Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. Example: Suppose I have a classifier f : X { 1, +1}. False positives cost c, false negatives cost 1 c. Given x, how to estimate expected cost of prediction f(x)? Expected cost: { (1 c) P(Y = +1 X = x) if f(x) = 1 c P(Y = 1 X = x) if f(x) = / 22
44 Why conditional probabilities? Having the conditional probability function η permits probabilistic reasoning. Example: Suppose I have a classifier f : X { 1, +1}. False positives cost c, false negatives cost 1 c. Given x, how to estimate expected cost of prediction f(x)? Expected cost: { (1 c) P(Y = +1 X = x) if f(x) = 1 c P(Y = 1 X = x) if f(x) = +1 To estimate expected cost, suffices to estimate η(x) = P(Y = +1 X = x). 10 / 22
45 Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? 11 / 22
46 Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? Squared loss: l sq(z) = (1 z) 2 11 / 22
47 Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? Squared loss: l sq(z) = (1 z) 2 E[l sq(y h(x)) X = x] = η(x) (1 h(x)) 2 + (1 η(x)) (1 + h(x)) / 22
48 Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? Squared loss: l sq(z) = (1 z) 2 E[l sq(y h(x)) X = x] = η(x) (1 h(x)) 2 + (1 η(x)) (1 + h(x)) 2. Using calculus, can see that this is minimized by h s.t. h(x) = 2η(x) 1, i.e., η(x) = 1+h(x) / 22
49 Eliciting conditional probabilities Can we estimate η by minimizing a particular (empirical) risk? Squared loss: l sq(z) = (1 z) 2 E[l sq(y h(x)) X = x] = η(x) (1 h(x)) 2 + (1 η(x)) (1 + h(x)) 2. Using calculus, can see that this is minimized by h s.t. h(x) = 2η(x) 1, i.e., η(x) = 1+h(x) 2. Recipe using plug-in principle: 1. Find scoring function ĥ: X R to (approximately) minimize (empirical) squared loss risk. 2. Construct estimate ˆη via ˆη(x) := 1 + ĥ(x), x X. 2 (May want to clip output to range [0, 1].) 11 / 22
50 Eliciting conditional probabilities (again) 12 / 22
51 Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) 12 / 22
52 Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) E[l logistic (Y h(x)) X = x] = η(x) log 2 (1+e h(x) )+(1 η(x)) log 2 (1+e h(x) ). 12 / 22
53 Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) E[l logistic (Y h(x)) X = x] = η(x) log 2 (1+e h(x) )+(1 η(x)) log 2 (1+e h(x) ). This is minimized by h s.t. logistic(h(x)) = η(x). 12 / 22
54 Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) E[l logistic (Y h(x)) X = x] = η(x) log 2 (1+e h(x) )+(1 η(x)) log 2 (1+e h(x) ). This is minimized by h s.t. logistic(h(x)) = η(x). Recipe using plug-in principle: 1. Find scoring function ĥ: X R to (approximately) minimize (empirical) logistic loss risk. 2. Construct estimate ˆη via ˆη(x) := logistic(h(x)), x X. 12 / 22
55 Eliciting conditional probabilities (again) Logistic loss: l logistic (z) = log 2 (1 + e z ) E[l logistic (Y h(x)) X = x] = η(x) log 2 (1+e h(x) )+(1 η(x)) log 2 (1+e h(x) ). This is minimized by h s.t. logistic(h(x)) = η(x). Recipe using plug-in principle: 1. Find scoring function ĥ: X R to (approximately) minimize (empirical) logistic loss risk. 2. Construct estimate ˆη via ˆη(x) := logistic(h(x)), x X. This works for many other losses / 22
56 Deficiency of hinge loss But not hinge loss: l hinge (z) = max{0, 1 z}. 13 / 22
57 Deficiency of hinge loss But not hinge loss: l hinge (z) = max{0, 1 z}. E[l hinge (Y h(x)) X = x] is minimized by h s.t. h(x) = sign(2η(x) 1). conditional hinge loss risk (x)=1/4 (x)=1/3 (x)=2/3 (x)=3/ h(x) 13 / 22
58 Deficiency of hinge loss But not hinge loss: l hinge (z) = max{0, 1 z}. E[l hinge (Y h(x)) X = x] is minimized by h s.t. h(x) = sign(2η(x) 1). conditional hinge loss risk (x)=1/4 (x)=1/3 (x)=2/3 (x)=3/ h(x) Cannot recover η(x) from the minimizing h(x). 13 / 22
59 Difficulty of estimating conditional probabilities Accurately estimating conditional probabilities is harder than learning a good classifier. Example: In general, min E[lsq(Y X T w)] min E[l sq(y h(x))]. w R d h : R d R Optimal scoring function x 2η(x) 1 is not necessarily well-approximated by a linear function. 14 / 22
60 Difficulty of estimating conditional probabilities Accurately estimating conditional probabilities is harder than learning a good classifier. Example: In general, min E[lsq(Y X T w)] min E[l sq(y h(x))]. w R d h : R d R Optimal scoring function x 2η(x) 1 is not necessarily well-approximated by a linear function. (We ll see an example of this later.) 14 / 22
61 Difficulty of estimating conditional probabilities Accurately estimating conditional probabilities is harder than learning a good classifier. Example: In general, min E[lsq(Y X T w)] min E[l sq(y h(x))]. w R d h : R d R Optimal scoring function x 2η(x) 1 is not necessarily well-approximated by a linear function. (We ll see an example of this later.) Common remedies: use more flexible models (e.g., polynomial functions). 14 / 22
62 Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. 15 / 22
63 Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. E.g., It rains on half of the days on which you say 50% chance of rain. 15 / 22
64 Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. E.g., It rains on half of the days on which you say 50% chance of rain. Note: This is achieved by a constant function: ˆη(x) = P(Y = 1), x X. So calibration is weaker than approximating the conditional probability function. 15 / 22
65 Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. E.g., It rains on half of the days on which you say 50% chance of rain. Note: This is achieved by a constant function: ˆη(x) = P(Y = 1), x X. So calibration is weaker than approximating the conditional probability function. Useful for interpretability. 15 / 22
66 Probabilistic calibration Say ˆη is (probabilistically) calibrated if P(Y = 1 ˆη(X) = p) = p, p [0, 1]. E.g., It rains on half of the days on which you say 50% chance of rain. Note: This is achieved by a constant function: ˆη(x) = P(Y = 1), x X. So calibration is weaker than approximating the conditional probability function. Useful for interpretability. To (approximately) achieve calibration, can use post-processing on output of scoring function h. 15 / 22
67 Achieving calibration via post-processing Use (validation) data ((x i, y i)) n i=1 from X {±1} to learn g : R [0, 1] so that ˆη := g h is (approximately) calibrated. 16 / 22
68 Achieving calibration via post-processing Use (validation) data ((x i, y i)) n i=1 from X {±1} to learn g : R [0, 1] so that ˆη := g h is (approximately) calibrated. Option #1: Use flexible model like Nadaraya-Watson regression to minimize suitable loss (e.g., squared loss). E.g., n i=1 1{yi = +1} K((z h(xi))/σ) g(z) := n i=1 K((z h(xi)/σ). K is a smoothing kernel, like K(δ) = exp( δ 2 /2); σ > 0 is the bandwidth. 16 / 22
69 Achieving calibration via post-processing Use (validation) data ((x i, y i)) n i=1 from X {±1} to learn g : R [0, 1] so that ˆη := g h is (approximately) calibrated. Option #1: Use flexible model like Nadaraya-Watson regression to minimize suitable loss (e.g., squared loss). E.g., n i=1 1{yi = +1} K((z h(xi))/σ) g(z) := n i=1 K((z h(xi)/σ). K is a smoothing kernel, like K(δ) = exp( δ 2 /2); σ > 0 is the bandwidth. Option #2: isotonic regression, which finds a monotone increasing g {y i =1} g(h(x i )) h(x) 16 / 22
70 Achieving calibration via post-processing Use (validation) data ((x i, y i)) n i=1 from X {±1} to learn g : R [0, 1] so that ˆη := g h is (approximately) calibrated. Option #1: Use flexible model like Nadaraya-Watson regression to minimize suitable loss (e.g., squared loss). E.g., n i=1 1{yi = +1} K((z h(xi))/σ) g(z) := n i=1 K((z h(xi)/σ). K is a smoothing kernel, like K(δ) = exp( δ 2 /2); σ > 0 is the bandwidth. Option #2: isotonic regression, which finds a monotone increasing g {y i =1} g(h(x i )) h(x) And many others / 22
71 4. Multiclass classification
72 Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). 17 / 22
73 Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). Logistic regression: generalizes to multinomial logistic regression, where P(Y = k X = x) = MLE here is similar to MLE in binary case. e xt β k K, x R d. k =1 ext β k 17 / 22
74 Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). Logistic regression: generalizes to multinomial logistic regression, where P(Y = k X = x) = e xt β k K, x R d. k =1 ext β k MLE here is similar to MLE in binary case. Perceptron, SVM: many extensions for multiclass. 17 / 22
75 Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). Logistic regression: generalizes to multinomial logistic regression, where P(Y = k X = x) = e xt β k K, x R d. k =1 ext β k MLE here is similar to MLE in binary case. Perceptron, SVM: many extensions for multiclass. Neural networks: often formulated like multinomial logistic regression. 17 / 22
76 Multiclass classification Often have more than two classes Y = {1,..., K} (multiclass classification). Logistic regression: generalizes to multinomial logistic regression, where P(Y = k X = x) = e xt β k K, x R d. k =1 ext β k MLE here is similar to MLE in binary case. Perceptron, SVM: many extensions for multiclass. Neural networks: often formulated like multinomial logistic regression. Can we generically reduce multiclass classification to binary classification? 17 / 22
77 One-against-all (of the rest) reduction One-against-all training input Labeled examples ((x i, y i )) n i=1 from X {1,..., K}; learning algorithm A that takes data from X {±1} and returns a scoring function h: X R. output Scoring functions ĥ1,..., ĥk : X R. 1: for k = 1,..., K do { 2: Create data set S k = ((x i, y (k) i )) n +1 if y i = k, i=1 where y(k) i = 1 otherwise. 3: Run A on S k to obtain scoring function h k : X R. 4: end for 5: return Scoring functions ĥ1,..., ĥk. 18 / 22
78 One-against-all (of the rest) reduction One-against-all training input Labeled examples ((x i, y i )) n i=1 from X {1,..., K}; learning algorithm A that takes data from X {±1} and returns a scoring function h: X R. output Scoring functions ĥ1,..., ĥk : X R. 1: for k = 1,..., K do { 2: Create data set S k = ((x i, y (k) i )) n +1 if y i = k, i=1 where y(k) i = 1 otherwise. 3: Run A on S k to obtain scoring function h k : X R. 4: end for 5: return Scoring functions ĥ1,..., ĥk. One-against-all prediction input Scoring functions ĥ1,..., ĥk : X R; new point x X. output Predicted label ŷ {1,..., K}. 1: return ŷ arg max k {1,...,K} ĥ k (x). 18 / 22
79 When does one-against-all work well? Main idea: Suppose every ĥk corresponds to good estimate ˆη k of conditional probability function ˆη k (x) P(Y = k X = x) on average. (E.g., ˆη k (x) = 1+ĥk(x) 2 for all k = 1,..., K.) 19 / 22
80 When does one-against-all work well? Main idea: Suppose every ĥk corresponds to good estimate ˆη k of conditional probability function ˆη k (x) P(Y = k X = x) on average. (E.g., ˆη k (x) = 1+ĥk(x) 2 for all k = 1,..., K.) Behavior of one-against-all classifier ˆf similar to Bayes optimal classifier! 19 / 22
81 Masking Suppose X = R, Y = {1, 2, 3}, and P Y X=x is piece-wise constant: P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = 1 E.g., P(Y = 1 X = x) = 1{x [0.1, 0.9]}, P(Y = 2 X = x) = 1{x [1.1, 1.9]}, P(Y = 3 X = x) = 1{x [2.1, 2.9]}. 20 / 22
82 Masking Suppose X = R, Y = {1, 2, 3}, and P Y X=x is piece-wise constant: P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = 1 E.g., P(Y = 1 X = x) = 1{x [0.1, 0.9]}, P(Y = 2 X = x) = 1{x [1.1, 1.9]}, P(Y = 3 X = x) = 1{x [2.1, 2.9]}. Try to fit P(Y = k X = x) using affine functions of x, e.g., ˆη k (x) = m k x + b k for some m k, b k R. 20 / 22
83 Masking (continued) P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = Best affine approximations to P(Y = k X = x) for each k: / 22
84 Masking (continued) P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = Best affine approximations to P(Y = k X = x) for each k: The arg max k ˆη k (x) is never 2, so never predicts label / 22
85 Masking (continued) P(Y = 1 X = x) = 1 P(Y = 2 X = x) = 1 P(Y = 3 X = x) = Best affine approximations to P(Y = k X = x) for each k: The arg max k ˆη k (x) is never 2, so never predicts label 2. How to fix this? 21 / 22
86 Key takeaways 1. Scoring functions and thresholds; alternative performance criteria. 2. Eliciting conditional probabilities with loss functions. 3. Calibration property. 4. One-against-all approach to multiclass classification; masking. 22 / 22
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationDecision trees COMS 4771
Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationLecture 10 February 23
EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationLecture 3: Multiclass Classification
Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in
More informationFast learning rates for plug-in classifiers under the margin condition
Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,
More informationLogistic regression and linear classifiers COMS 4771
Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random
More informationClassification and Pattern Recognition
Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations
More informationDS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM
DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM Due: Monday, April 11, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to
More informationLearning from Corrupted Binary Labels via Class-Probability Estimation
Learning from Corrupted Binary Labels via Class-Probability Estimation Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson xxx National ICT Australia and The Australian National
More informationClassification Based on Probability
Logistic Regression These slides were assembled by Byron Boots, with only minor modifications from Eric Eaton s slides and grateful acknowledgement to the many others who made their course materials freely
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationLogistic Regression Trained with Different Loss Functions. Discussion
Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =
More informationNeural networks COMS 4771
Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More information1 Review of Winnow Algorithm
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # 17 Scribe: Xingyuan Fang, Ethan April 9th, 2013 1 Review of Winnow Algorithm We have studied Winnow algorithm in Algorithm 1. Algorithm
More informationMLE/MAP + Naïve Bayes
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)
More informationHomework 6. Due: 10am Thursday 11/30/17
Homework 6 Due: 10am Thursday 11/30/17 1. Hinge loss vs. logistic loss. In class we defined hinge loss l hinge (x, y; w) = (1 yw T x) + and logistic loss l logistic (x, y; w) = log(1 + exp ( yw T x ) ).
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationMachine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015
Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary
More informationMIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE
MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on
More informationNaïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationAdaBoost and other Large Margin Classifiers: Convexity in Classification
AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic
More informationMachine Learning (CSE 446): Multi-Class Classification; Kernel Methods
Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements HW3 due date as posted. make sure
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationMachine Learning and Data Mining. Linear classification. Kalev Kask
Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q
More informationEvaluation requires to define performance measures to be optimized
Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation
More informationDay 4: Classification, support vector machines
Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018 Topics so far
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More informationApplied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson
Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More informationMinimax risk bounds for linear threshold functions
CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability
More informationLecture Notes 15 Prediction Chapters 13, 22, 20.4.
Lecture Notes 15 Prediction Chapters 13, 22, 20.4. 1 Introduction Prediction is covered in detail in 36-707, 36-701, 36-715, 10/36-702. Here, we will just give an introduction. We observe training data
More informationComments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert
Logistic regression Comments Mini-review and feedback These are equivalent: x > w = w > x Clarification: this course is about getting you to be able to think as a machine learning expert There has to be
More informationLinear Classifiers. Michael Collins. January 18, 2012
Linear Classifiers Michael Collins January 18, 2012 Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm Classification Problems: An Example Goal: build a system that
More informationLinear Methods for Classification
Linear Methods for Classification Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Classification Supervised learning Training data: {(x 1, g 1 ), (x 2, g 2 ),..., (x
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationLinear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)
Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,
More informationLogistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com
Logistic Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationSurrogate regret bounds for generalized classification performance metrics
Surrogate regret bounds for generalized classification performance metrics Wojciech Kotłowski Krzysztof Dembczyński Poznań University of Technology PL-SIGML, Częstochowa, 14.04.2016 1 / 36 Motivation 2
More informationEvaluation. Andrea Passerini Machine Learning. Evaluation
Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain
More informationSupport Vector Machines, Kernel SVM
Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM
More informationMidterm: CS 6375 Spring 2015 Solutions
Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationMachine Learning. Lecture 3: Logistic Regression. Feng Li.
Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationStephen Scott.
1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationAlgorithms for Predicting Structured Data
1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationLinear regression COMS 4771
Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between
More informationConvex optimization COMS 4771
Convex optimization COMS 4771 1. Recap: learning via optimization Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + 1 / 15 Soft-margin
More informationNatural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley
Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationMachine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.
More informationMachine Learning: A Statistics and Optimization Perspective
Machine Learning: A Statistics and Optimization Perspective Nan Ye Mathematical Sciences School Queensland University of Technology 1 / 109 What is Machine Learning? 2 / 109 Machine Learning Machine learning
More informationReducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning
More informationNatural Language Processing (CSEP 517): Text Classification
Natural Language Processing (CSEP 517): Text Classification Noah Smith c 2017 University of Washington nasmith@cs.washington.edu April 10, 2017 1 / 71 To-Do List Online quiz: due Sunday Read: Jurafsky
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationNaïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.
School of Computer Science 10-701 Introduction to Machine Learning aïve Bayes Readings: Mitchell Ch. 6.1 6.10 Murphy Ch. 3 Matt Gormley Lecture 3 September 14, 2016 1 Homewor 1: due 9/26/16 Project Proposal:
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationOnline Advertising is Big Business
Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationGRADIENT DESCENT AND LOCAL MINIMA
GRADIENT DESCENT AND LOCAL MINIMA 25 20 5 15 10 3 2 1 1 2 5 2 2 4 5 5 10 Suppose for both functions above, gradient descent is started at the point marked red. It will walk downhill as far as possible,
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationCS446: Machine Learning Fall Final Exam. December 6 th, 2016
CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationTufts COMP 135: Introduction to Machine Learning
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)
More informationGradient-Based Learning. Sargur N. Srihari
Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation
More informationDEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE
Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationOutline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More information