Classification Logistic Regression

Similar documents
Classification Logistic Regression

Is the test error unbiased for these programs?

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Announcements Kevin Jamieson

Classification Logistic Regression

Warm up: risk prediction with logistic regression

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Linear classifiers: Logistic regression

Announcements. Proposals graded

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Linear classifiers: Overfitting and regularization

Ad Placement Strategies

CPSC 340 Assignment 4 (due November 17 ATE)

Support Vector Machines

Regression with Numerical Optimization. Logistic

Introduction to Machine Learning

Machine Learning Practice Page 2 of 2 10/28/13

CS489/698: Intro to ML

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Linear Models in Machine Learning

Announcements Kevin Jamieson

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Stochastic Gradient Descent

Generative v. Discriminative classifiers Intuition

Least Squares Regression

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Machine Learning, Fall 2012 Homework 2

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. William Cohen

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Linear Models for Regression CS534

10-701/ Machine Learning - Midterm Exam, Fall 2010

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Linear and logistic regression

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Linear Models for Regression CS534

Classification Based on Probability

CS60021: Scalable Data Mining. Large Scale Machine Learning

Probabilistic Machine Learning. Industrial AI Lab.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Least Squares Regression

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Introduction to Machine Learning

Linear Models for Regression CS534

An Introduction to Statistical and Probabilistic Linear Models

Logistic Regression. Seungjin Choi

Midterm exam CS 189/289, Fall 2015

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

Logistic Regression Logistic

Generative v. Discriminative classifiers Intuition

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Statistical Data Mining and Machine Learning Hilary Term 2016

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Logistic Regression. COMP 527 Danushka Bollegala

Introduction to Logistic Regression

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 2, Kevin Jamieson 1

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning Lecture 7

Machine Learning Lecture 5

ECE 5984: Introduction to Machine Learning

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Introduction to Logistic Regression and Support Vector Machine

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Ch 4. Linear Models for Classification

Machine Learning Gaussian Naïve Bayes Big Picture

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning: Assignment 1

Tufts COMP 135: Introduction to Machine Learning

Announcements. stuff stat 538 Zaid Hardhaoui. Statistics. cry. spring. g fa. inference. VC dimension covering

Machine Learning 4771

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

CPSC 340: Machine Learning and Data Mining

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

1 Machine Learning Concepts (16 points)

Bias-Variance Tradeoff

Machine Learning

Support Vector Machines and Kernel Methods

Lecture 2 Machine Learning Review

CPSC 340: Machine Learning and Data Mining

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Machine Learning - Waseda University Logistic Regression

Overfitting, Bias / Variance Analysis

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

SVMs, Duality and the Kernel Trick

Logistic Regression & Neural Networks

Transcription:

Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1

THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS 2

Weather prediction revisted Temperature 3

Reading Your Brain, Simple Example Pairwise classification accuracy: 85% Person Animal [Mitchell et al.] 4

Binary Classification Learn: f:x >Y X features Y target classes Y 2 {0, 1} Loss function: Expected loss of f: Suppose you know P(Y X) exactly, how should you classify? Bayes optimal classifier: 5

Binary Classification Learn: f:x >Y X features Y target classes Y 2 {0, 1} Loss function: `(f(x),y)=1{f(x) 6= y} Expected loss of f: Suppose you know P(Y X) exactly, how should you classify? Bayes optimal classifier: E XY [1{f(X) 6= Y }] =E X [E Y X [1{f(x) 6= Y } X = x]] E Y X [1{f(x) 6= Y } X = x] = X i f(x) = arg max y P (Y = i X = x)1{f(x) 6= i} = X i6=f(x) P(Y = y X = x) P (Y = i X = x) =1 P (Y = f(x) X = x) 6

Link Functions Estimating P(Y X): Why not use standard linear regression? Combining regression and probability? Need a mapping from real values to [0,1] A link function! 7

Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular functional form for link function Sigmoid applied to a linear function of the input features: Z Features can be discrete or continuous! 8

Understanding the sigmoid w 0 =-2, w 1 =-1 w 0 =0, w 1 =-1 w 0 =0, w 1 =-0.5 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 9

Sigmoid for binary classes P(Y =0 w, X) = 1 1 + exp(w 0 + P k w kx k ) P(Y =1 w, X) =1 P(Y =0 w, X) = exp(w 0 + P k w kx k ) 1 + exp(w 0 + P k w kx k ) P(Y =1 w, X) P(Y =0 w, X) = 10

Sigmoid for binary classes P(Y =0 w, X) = 1 1 + exp(w 0 + P k w kx k ) P(Y =1 w, X) =1 P(Y =0 w, X) = exp(w 0 + P k w kx k ) 1 + exp(w 0 + P k w kx k ) P(Y =1 w, X) P(Y =0 w, X) =exp(w 0 + X k w k X k ) log P(Y =1 w, X) P(Y =0 w, X) = w 0 + X k w k X k Linear Decision Rule! 11

Logistic Regression a Linear classifier 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 12

Loss function: Conditional Likelihood Have a bunch of iid data of the form: P (Y = 1 x, w) = P (Y =1 x, w) = exp(wt x) 1 + exp(w T x) {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} 1 1 + exp(w T x) This is equivalent to: P (Y = y x, w) = 1 1 + exp( yw T x) So we can compute the maximum likelihood estimator: bw MLE = arg max w ny P (y i x i,w) i=1 13

Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 y i x T i w)) 1 1 + exp( yw T x) 14

Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) y i x T i w)) 1 1 + exp( yw T x) Squared error Loss: `i(w) =(y i x T i w)2 (MLE for Gaussian noise) 15

Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 What does J(w) look like? Is it convex? y i x T i w)) = J(w) 1 1 + exp( yw T x) 16

Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 y i x T i w)) = J(w) 1 1 + exp( yw T x) Good news: J(w) is convex function of w, no local optima problems Bad news: no closed-form solution to maximize J(w) Good news: convex functions easy to optimize 17

Linear Separability arg min w nx log(1 + exp( y i x T i w)) When is this loss small? i=1 18

Large parameters Overfitting If data is linearly separable, weights go to infinity In general, leads to overfitting: Penalizing high weights can prevent overfitting 19

Regularized Conditional Log Likelihood Add regularization penalty, e.g., L 2 : nx arg min log 1 + exp( y i (x T i w + b)) + w 2 2 w,b i=1 Be sure to not regularize the o set b! 20

Gradient Descent Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 21

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) 22

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) x x y g is a subgradient at x if f(y) f(x)+g T (y x) f convex: f ( x +(1 )y) apple f(x)+(1 )f(y) 8x, y, 2 [0, 1] f(y) f(x)+rf(x) T (y x) 8x, y 23

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 24

Least squares Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) Squared error Loss: `i(w) =(y i x T i w)2 How does software solve: 1 2 Xw y 2 2 25

Least squares Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx i=1 `i(w) Squared error Loss: `i(w) =(y i x T i w)2 How does software solve: its complicated: (LAPACK, BLAS, MKL ) 1 2 Xw y 2 2 Do you need high precision? Is X column/row sparse? Is bw LS sparse? Is X T X well-conditioned? Can X T X fit in cache/memory? 26

Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) + 1 2 f 00 (x) 2 +... Gradient descent: 27

Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v + 1 2 vt r 2 f(x)v +... Gradient descent: 28

Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) rf(w) = 29

Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) (w t+1 w )=(I X T X)(w t w ) Example: X= =(I X T X) t+1 (w 0 w ) apple 10 3 0 0 1 y= apple 10 3 1 w 0 = apple 0 0 w = 30

Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) + 1 2 f 00 (x) 2 +... Newton s method: 31

Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v + 1 2 vt r 2 f(x)v +... Newton s method: 32

Newton s Method f(w) = 1 2 Xw y 2 2 rf(w) = r 2 f(w) = v t is solution to : r 2 f(w t )v t = rf(w t ) w t+1 = w t + v t 33

Newton s Method f(w) = 1 2 Xw y 2 2 rf(w) = X T (Xw y) r 2 f(w) = X T X v t is solution to : r 2 f(w t )v t = rf(w t ) w t+1 = w t + v t For quadratics, Newton s method converges in one step! (Not a surprise, why?) w 1 = w 0 (X T X) 1 X T (Xw 0 y)=w 34

General case In general for Newton s method to achieve f(w t ) f(w ) apple : So why are ML problems overwhelmingly solved by gradient methods? Hint: v t is solution to : r 2 f(w t )v t = rf(w t ) 35

General Convex case f(w t ) f(w ) apple Clean converge nice proofs: Bubeck Newton s method: t log(log(1/ )) Gradient descent: f is smooth and strongly convex: ai r 2 f(w) : bi f is smooth: r 2 f(w) bi f is potentially non-differentiable: rf(w) 2 apple c Nocedal +Wright, Bubeck Other: BFGS, Heavy-ball, BCD, SVRG, ADAM, Adagrad, 36

Revisiting Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 37

Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w f(w) rf(w) = = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 y i x T i w)) 1 1 + exp( yw T x) 38