Warm up: risk prediction with logistic regression

Similar documents
Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Classification Logistic Regression

Kernelized Perceptron Support Vector Machines

Is the test error unbiased for these programs?

Logistic Regression Logistic

Machine Learning Practice Page 2 of 2 10/28/13

Support vector machines Lecture 4

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Logistic Regression. Machine Learning Fall 2018

Classification Logistic Regression

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Announcements. Proposals graded

1 Machine Learning Concepts (16 points)

CPSC 340: Machine Learning and Data Mining

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Classification Logistic Regression

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning for NLP

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

CSC 411 Lecture 17: Support Vector Machine

Announcements Kevin Jamieson

Announcements - Homework

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Logistic Regression. COMP 527 Danushka Bollegala

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Discriminative Models

Discriminative Models

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Linear classifiers: Overfitting and regularization

Kernel Methods and Support Vector Machines

Machine Learning for NLP

Support Vector Machines

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Multiclass Classification-1

Introduction to Logistic Regression and Support Vector Machine

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Stochastic Gradient Descent

Linear & nonlinear classifiers

6.036 midterm review. Wednesday, March 18, 15

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Deep Learning for Computer Vision

Ad Placement Strategies

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

EE 511 Online Learning Perceptron

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Machine Learning Basics

Perceptron (Theory) + Linear Regression

Support Vector Machines for Classification: A Statistical Portrait

10-701/ Machine Learning - Midterm Exam, Fall 2010

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Classification and Pattern Recognition

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Classification Based on Probability

More about the Perceptron

Lecture 2 Machine Learning Review

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines for Classification and Regression

Support Vector Machines

Announcements Kevin Jamieson

CS145: INTRODUCTION TO DATA MINING

Machine Learning and Data Mining. Linear classification. Kalev Kask

Jeff Howbert Introduction to Machine Learning Winter

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

ECE521 week 3: 23/26 January 2017

The Perceptron Algorithm

Lecture 3: Multiclass Classification

MIRA, SVM, k-nn. Lirong Xia

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

ESS2222. Lecture 4 Linear model

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Lecture Support Vector Machine (SVM) Classifiers

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Binary Classification / Perceptron

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Basis Expansion and Nonlinear SVM. Kai Yu

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Linear Models in Machine Learning

Linear classifiers: Logistic regression

Midterm exam CS 189/289, Fall 2015

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Linear Regression (continued)

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Transcription:

Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T x) And compute the maximum likelihood estimator: ny bw MLE = arg max P (y i x i,w) w i= For a new loan application x, boss recommends to give loan if your model says they will repay it with probability at least.95 (i.e. low risk): Give loan to x if + exp( bw T MLE x).95 One year later only half of loans are paid back and the bank folds. What might have happened? Kevin Jamieson 208

Projects Proposal due Thursday 0/25 Guiding principles (for evaluation of project) - Keep asking yourself why something works or not. Dig deeper than just evaluating the method and reporting a test error. - Must use real-world data available NOW - Must report metrics - Must reference papers and/or books - Study a real-world dataset - Evaluate multiple machine learning methods - Why does one work better than another? Form a hypothesis and test the hypothesis with a subset of the real data or, if necessary, synthetic data - Study a method - Evaluate on multiple real-world datasets - Why does the method work better on one dataset versus another? Form a hypothesis Kevin Jamieson 208 2

Perceptron Machine Learning CSE546 Kevin Jamieson University of Washington October 23, 208 Kevin Jamieson 208 3

6 Binary Classification Learn: f:x >Y X features Y target classes Y 2 {, } Expected loss of f: X Loss function: `(f(x),y)={f(x) 6= y} E XY [{f(x) 6= Y }] =E X [E Y X [{f(x) 6= Y } X = x]] E Y X [{f(x) 6= Y } X = x] = X P (Y = f(x) X = x) Bayes optimal classifier: f(x) = arg max y P(Y = y X = x) Kevin Jamieson 208 4

6 Binary Classification Learn: f:x >Y X features Y target classes Y 2 {, } Expected loss of f: Bayes optimal classifier: Model of logistic regression: X Loss function: `(f(x),y)={f(x) 6= y} E XY [{f(x) 6= Y }] =E X [E Y X [{f(x) 6= Y } X = x]] E Y X [{f(x) 6= Y } X = x] = X P (Y = f(x) X = x) f(x) = arg max y P (Y = y x, w) = What if the model is wrong? P(Y = y X = x) + exp( yw T x) Kevin Jamieson 208 5

Binary Classification Can we do classification without a model of P(Y = y X = x)? Kevin Jamieson 206 6

The Perceptron Algorithm [Rosenblatt 58, 62] Classification setting: y in {-,+} Linear model Prediction: Training: Initialize weight vector: At each time step: Observe features: Make prediction: Observe true class: Update model: If prediction is not equal to truth Kevin Jamieson 206 7

The Perceptron Algorithm [Rosenblatt 58, 62] Classification setting: y in {-,+} Linear model Prediction: Training: Initialize weight vector: At each time step: Observe features: Make prediction: Observe true class: sign(w T x i + b) w 0 =0,b 0 =0 x k sign(x T k w k + b k ) y k Update model: If prediction is not equal to truth apple wk+ b k+ = apple wk b k + y k apple xk Kevin Jamieson 206 8

Rosenblatt 957 "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." The New York Times, 958 Kevin Jamieson 206 9

Linear Separability Perceptron guaranteed to converge if Data linearly separable: Kevin Jamieson 206 0

Perceptron Analysis: Linearly Separable Case Theorem [Block, Novikoff]: Given a sequence of labeled examples: Each feature vector has bounded norm: If dataset is linearly separable: Then the number of mistakes made by the online perceptron on any such sequence is bounded by Kevin Jamieson 206

Beyond Linearly Separable Case Perceptron algorithm is super cool! No assumption about data distribution! Could be generated by an oblivious adversary, no need to be iid Makes a fixed number of mistakes, and it s done for ever! Even if you see infinite data Kevin Jamieson 206 2

Beyond Linearly Separable Case Perceptron algorithm is super cool! No assumption about data distribution! Could be generated by an oblivious adversary, no need to be iid Makes a fixed number of mistakes, and it s done for ever! Even if you see infinite data Perceptron is useless in practice! Real world not linearly separable If data not separable, cycles forever and hard to detect Even if separable may not give good generalization accuracy (small margin) Kevin Jamieson 206 3

What is the Perceptron Doing??? When we discussed logistic regression: Started from maximizing conditional log-likelihood When we discussed the Perceptron: Started from description of an algorithm What is the Perceptron optimizing???? Kevin Jamieson 206 4

Support Vector Machines Machine Learning CSE546 Kevin Jamieson University of Washington October 23, 208 208 Kevin Jamieson 5

Linear classifiers Which line is better? 208 Kevin Jamieson 6

Pick the one with the largest margin! x T w + b =0 margin 2γ 208 Kevin Jamieson 7

Pick the one with the largest margin! w x 0 x T w + b =0 Distance from x 0 to hyperplane defined by x T w + b = 0? 208 Kevin Jamieson 8

Pick the one with the largest margin! w x 0 x T w + b =0 Distance from x 0 to hyperplane defined by x T w + b = 0? If ex 0 is the projection of x 0 onto the hyperplane then x 0 ex 0 2 = (x T 0 ex 0 ) T w w 2 = w 2 x T 0 w ex T 0 w = w 2 x T 0 w + b 208 Kevin Jamieson 9

Pick the one with the largest margin! x T w + b =0 Distance of x 0 from hyperplane x T w + b: w 2 (x T 0 w + b) Optimal Hyperplane max w,b subject to w 2 y i (x T i w + b) 8i margin 2γ 208 Kevin Jamieson 20

Pick the one with the largest margin! x T w + b =0 Distance of x 0 from hyperplane x T w + b: w 2 (x T 0 w + b) Optimal Hyperplane max w,b subject to w 2 y i (x T i w + b) 8i Optimal Hyperplane (reparameterized) margin 2γ min w,b w 2 2 subject to y i (x T i w + b) 8i 208 Kevin Jamieson 2

Pick the one with the largest margin! x T w + b =0 Solve efficiently by many methods, e.g., quadratic programming (QP) Well-studied solution algorithms Stochastic gradient descent Coordinate descent (in the dual) Optimal Hyperplane (reparameterized) margin 2γ min w,b w 2 2 subject to y i (x T i w + b) 8i 208 Kevin Jamieson 22

What if the data is still not linearly separable? x T w + b =0 w 2 If data is linearly separable min w,b w 2 2 y i (x T i w + b) 8i w 2 208 Kevin Jamieson 23

What if the data is still not linearly separable? x T w + b =0 If data is linearly separable w 2 min w,b w 2 2 y i (x T i w + b) 8i If data is not linearly separable, some points don t satisfy margin constraint: x T w + b =0 w 2 w 2 min w,b w 2 2 y i (x T i w + b) i 8i nx i 0, j apple j= w 2 208 Kevin Jamieson 24

What if the data is still not linearly separable? x T w + b =0 If data is linearly separable w 2 min w,b w 2 2 y i (x T i w + b) 8i If data is not linearly separable, some points don t satisfy margin constraint: x T w + b =0 w 2 w 2 min w,b w 2 2 y i (x T i w + b) i 8i nx i 0, j apple j= w 2 What are support vectors? 208 Kevin Jamieson 25

SVM as penalization method Original quadratic program with linear constraints: min w,b w 2 2 y i (x T i w + b) i 8i nx i 0, j apple j= 208 Kevin Jamieson 26

SVM as penalization method Original quadratic program with linear constraints: min w,b w 2 2 y i (x T i w + b) i 8i nx i 0, j apple j= Using same constrained convex optimization trick as for lasso: For any 0 there exists a 0 such that the solution the following solution is equivalent: nx max{0, y i (b + x T i w)} + w 2 2 i= 208 Kevin Jamieson 27

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i= x i 2 R d y i 2 R nx Learning a model s parameters: Each `i(w) is convex. i= `i(w) Hinge Loss: `i(w) = max{0, y i x T i w} Logistic Loss: `i(w) = log( + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 How do we solve for w? The last two lectures! Kevin Jamieson 208 28

Perceptron is optimizing what? Perceptron update rule: apple wk+ b k+ = apple wk b k + y k apple xk {y i (b + x T i w) < 0} SVM objective: nx max{0, y i (b + x T i w)} + w 2 2 = i= nx `i(w, b) i= r w`i(w, b) = 208 Kevin Jamieson 29

Perceptron is optimizing what? Perceptron update rule: apple wk+ b k+ = apple wk b k + y k apple xk {y i (b + x T i w) < 0} SVM objective: nx max{0, y i (b + x T i w)} + w 2 2 = i= nx `i(w, b) i= r w`i(w, b) = ( r b`i(w, b) = ( x i y i + 2 n w if y i(b + x T i w) < 0 otherwise y i if y i (b + x T i w) < 0 otherwise Perceptron is just SGD on SVM with = 0, =! 208 Kevin Jamieson 30

SVMs vs logistic regression We often want probabilities/confidences, logistic wins here? 208 Kevin Jamieson 3

SVMs vs logistic regression We often want probabilities/confidences, logistic wins here? No! Perform isotonic regression or non-parametric bootstrap for probability calibration. Predictor gives some score, how do we transform that score to a probability? 208 Kevin Jamieson 32

SVMs vs logistic regression We often want probabilities/confidences, logistic wins here? No! Perform isotonic regression or non-parametric bootstrap for probability calibration. Predictor gives some score, how do we transform that score to a probability? For classification loss, logistic and svm are comparable Multiclass setting: Softmax naturally generalizes logistic regression SVMs have What about good old least squares? 208 Kevin Jamieson 33

What about multiple classes? 208 Kevin Jamieson 34