Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T x) And compute the maximum likelihood estimator: ny bw MLE = arg max P (y i x i,w) w i= For a new loan application x, boss recommends to give loan if your model says they will repay it with probability at least.95 (i.e. low risk): Give loan to x if + exp( bw T MLE x).95 One year later only half of loans are paid back and the bank folds. What might have happened? Kevin Jamieson 208

Projects Proposal due Thursday 0/25 Guiding principles (for evaluation of project) - Keep asking yourself why something works or not. Dig deeper than just evaluating the method and reporting a test error. - Must use real-world data available NOW - Must report metrics - Must reference papers and/or books - Study a real-world dataset - Evaluate multiple machine learning methods - Why does one work better than another? Form a hypothesis and test the hypothesis with a subset of the real data or, if necessary, synthetic data - Study a method - Evaluate on multiple real-world datasets - Why does the method work better on one dataset versus another? Form a hypothesis Kevin Jamieson 208 2

Perceptron Machine Learning CSE546 Kevin Jamieson University of Washington October 23, 208 Kevin Jamieson 208 3

6 Binary Classification Learn: f:x >Y X features Y target classes Y 2 {, } Expected loss of f: X Loss function: `(f(x),y)={f(x) 6= y} E XY [{f(x) 6= Y }] =E X [E Y X [{f(x) 6= Y } X = x]] E Y X [{f(x) 6= Y } X = x] = X P (Y = f(x) X = x) Bayes optimal classifier: f(x) = arg max y P(Y = y X = x) Kevin Jamieson 208 4

6 Binary Classification Learn: f:x >Y X features Y target classes Y 2 {, } Expected loss of f: Bayes optimal classifier: Model of logistic regression: X Loss function: `(f(x),y)={f(x) 6= y} E XY [{f(x) 6= Y }] =E X [E Y X [{f(x) 6= Y } X = x]] E Y X [{f(x) 6= Y } X = x] = X P (Y = f(x) X = x) f(x) = arg max y P (Y = y x, w) = What if the model is wrong? P(Y = y X = x) + exp( yw T x) Kevin Jamieson 208 5

Binary Classification Can we do classification without a model of P(Y = y X = x)? Kevin Jamieson 206 6

The Perceptron Algorithm [Rosenblatt 58, 62] Classification setting: y in {-,+} Linear model Prediction: Training: Initialize weight vector: At each time step: Observe features: Make prediction: Observe true class: Update model: If prediction is not equal to truth Kevin Jamieson 206 7

The Perceptron Algorithm [Rosenblatt 58, 62] Classification setting: y in {-,+} Linear model Prediction: Training: Initialize weight vector: At each time step: Observe features: Make prediction: Observe true class: sign(w T x i + b) w 0 =0,b 0 =0 x k sign(x T k w k + b k ) y k Update model: If prediction is not equal to truth apple wk+ b k+ = apple wk b k + y k apple xk Kevin Jamieson 206 8

Rosenblatt 957 "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." The New York Times, 958 Kevin Jamieson 206 9

Linear Separability Perceptron guaranteed to converge if Data linearly separable: Kevin Jamieson 206 0

Perceptron Analysis: Linearly Separable Case Theorem [Block, Novikoff]: Given a sequence of labeled examples: Each feature vector has bounded norm: If dataset is linearly separable: Then the number of mistakes made by the online perceptron on any such sequence is bounded by Kevin Jamieson 206

Beyond Linearly Separable Case Perceptron algorithm is super cool! No assumption about data distribution! Could be generated by an oblivious adversary, no need to be iid Makes a fixed number of mistakes, and it s done for ever! Even if you see infinite data Perceptron is useless in practice! Real world not linearly separable If data not separable, cycles forever and hard to detect Even if separable may not give good generalization accuracy (small margin) Kevin Jamieson 206 3

What is the Perceptron Doing??? When we discussed logistic regression: Started from maximizing conditional log-likelihood When we discussed the Perceptron: Started from description of an algorithm What is the Perceptron optimizing???? Kevin Jamieson 206 4

Support Vector Machines Machine Learning CSE546 Kevin Jamieson University of Washington October 23, 208 208 Kevin Jamieson 5

Linear classifiers Which line is better? 208 Kevin Jamieson 6

Pick the one with the largest margin! x T w + b =0 margin 2γ 208 Kevin Jamieson 7

Pick the one with the largest margin! w x 0 x T w + b =0 Distance from x 0 to hyperplane defined by x T w + b = 0? 208 Kevin Jamieson 8

Pick the one with the largest margin! w x 0 x T w + b =0 Distance from x 0 to hyperplane defined by x T w + b = 0? If ex 0 is the projection of x 0 onto the hyperplane then x 0 ex 0 2 = (x T 0 ex 0 ) T w w 2 = w 2 x T 0 w ex T 0 w = w 2 x T 0 w + b 208 Kevin Jamieson 9

Pick the one with the largest margin! x T w + b =0 Distance of x 0 from hyperplane x T w + b: w 2 (x T 0 w + b) Optimal Hyperplane max w,b subject to w 2 y i (x T i w + b) 8i margin 2γ 208 Kevin Jamieson 20

Pick the one with the largest margin! x T w + b =0 Distance of x 0 from hyperplane x T w + b: w 2 (x T 0 w + b) Optimal Hyperplane max w,b subject to w 2 y i (x T i w + b) 8i Optimal Hyperplane (reparameterized) margin 2γ min w,b w 2 2 subject to y i (x T i w + b) 8i 208 Kevin Jamieson 2

Pick the one with the largest margin! x T w + b =0 Solve efficiently by many methods, e.g., quadratic programming (QP) Well-studied solution algorithms Stochastic gradient descent Coordinate descent (in the dual) Optimal Hyperplane (reparameterized) margin 2γ min w,b w 2 2 subject to y i (x T i w + b) 8i 208 Kevin Jamieson 22

What if the data is still not linearly separable? x T w + b =0 w 2 If data is linearly separable min w,b w 2 2 y i (x T i w + b) 8i w 2 208 Kevin Jamieson 23

What if the data is still not linearly separable? x T w + b =0 If data is linearly separable w 2 min w,b w 2 2 y i (x T i w + b) 8i If data is not linearly separable, some points don t satisfy margin constraint: x T w + b =0 w 2 w 2 min w,b w 2 2 y i (x T i w + b) i 8i nx i 0, j apple j= w 2 208 Kevin Jamieson 24

SVM as penalization method Original quadratic program with linear constraints: min w,b w 2 2 y i (x T i w + b) i 8i nx i 0, j apple j= 208 Kevin Jamieson 26

SVM as penalization method Original quadratic program with linear constraints: min w,b w 2 2 y i (x T i w + b) i 8i nx i 0, j apple j= Using same constrained convex optimization trick as for lasso: For any 0 there exists a 0 such that the solution the following solution is equivalent: nx max{0, y i (b + x T i w)} + w 2 2 i= 208 Kevin Jamieson 27

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i= x i 2 R d y i 2 R nx Learning a model s parameters: Each `i(w) is convex. i= `i(w) Hinge Loss: `i(w) = max{0, y i x T i w} Logistic Loss: `i(w) = log( + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 How do we solve for w? The last two lectures! Kevin Jamieson 208 28

Perceptron is optimizing what? Perceptron update rule: apple wk+ b k+ = apple wk b k + y k apple xk {y i (b + x T i w) < 0} SVM objective: nx max{0, y i (b + x T i w)} + w 2 2 = i= nx `i(w, b) i= r w`i(w, b) = ( r b`i(w, b) = ( x i y i + 2 n w if y i(b + x T i w) < 0 otherwise y i if y i (b + x T i w) < 0 otherwise Perceptron is just SGD on SVM with = 0, =! 208 Kevin Jamieson 30

SVMs vs logistic regression We often want probabilities/confidences, logistic wins here? 208 Kevin Jamieson 3

SVMs vs logistic regression We often want probabilities/confidences, logistic wins here? No! Perform isotonic regression or non-parametric bootstrap for probability calibration. Predictor gives some score, how do we transform that score to a probability? For classification loss, logistic and svm are comparable Multiclass setting: Softmax naturally generalizes logistic regression SVMs have What about good old least squares? 208 Kevin Jamieson 33

What about multiple classes? 208 Kevin Jamieson 34