Classification Logistic Regression

Similar documents
Classification Logistic Regression

Is the test error unbiased for these programs?

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Announcements Kevin Jamieson

Classification Logistic Regression

Warm up: risk prediction with logistic regression

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 2, Kevin Jamieson 1

Announcements. Proposals graded

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Linear classifiers: Logistic regression

CPSC 340 Assignment 4 (due November 17 ATE)

Linear classifiers: Overfitting and regularization

Ad Placement Strategies

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Regression with Numerical Optimization. Logistic

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Support Vector Machines

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Stochastic Gradient Descent

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

CS60021: Scalable Data Mining. Large Scale Machine Learning

Machine Learning, Fall 2012 Homework 2

Announcements. stuff stat 538 Zaid Hardhaoui. Statistics. cry. spring. g fa. inference. VC dimension covering

Introduction to Machine Learning

Machine Learning Practice Page 2 of 2 10/28/13

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Logistic Regression. William Cohen

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Generative v. Discriminative classifiers Intuition

ECE 5984: Introduction to Machine Learning

Announcements Kevin Jamieson

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Overfitting, Bias / Variance Analysis

Least Squares Regression

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

CS489/698: Intro to ML

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Linear Models in Machine Learning

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Linear Models for Regression CS534

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Linear and logistic regression

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning 4771

10-701/ Machine Learning - Midterm Exam, Fall 2010

Classification Based on Probability

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Linear Models for Regression CS534

Today. Calculus. Linear Regression. Lagrange Multipliers

Machine Learning Lecture 7

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Logistic Regression. Machine Learning Fall 2018

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

1 Machine Learning Concepts (16 points)

Least Squares Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Logistic Regression. Seungjin Choi

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Linear Models for Regression CS534

Classification. Sandro Cumani. Politecnico di Torino

Bias-Variance Tradeoff

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning Lecture 5

CSC 411: Lecture 04: Logistic Regression

Generative v. Discriminative classifiers Intuition

Machine Learning - Waseda University Logistic Regression

Lecture 2 Machine Learning Review

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Logistic Regression Logistic

Midterm exam CS 189/289, Fall 2015

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines and Kernel Methods

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Discriminative Models

Optimization Methods for Machine Learning

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Ch 4. Linear Models for Classification

Probabilistic Machine Learning. Industrial AI Lab.

Statistical Data Mining and Machine Learning Hilary Term 2016

Summary and discussion of: Dropout Training as Adaptive Regularization

Introduction to Logistic Regression

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine Learning for Signal Processing Bayes Classification and Regression

Introduction to Logistic Regression and Support Vector Machine

Tufts COMP 135: Introduction to Machine Learning

Discriminative Models

Machine Learning

Introduction to Machine Learning

Neural Network Training

Ad Placement Strategies

An Introduction to Statistical and Probabilistic Linear Models

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

Transcription:

O due Thursday µtwl Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1

THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS 2

Weather prediction revisted Temperature 0 3

Reading Your Brain, Simple Example Pairwise classification accuracy: 85% Person Animal [Mitchell et al.] 4

Binary Classification O Learn: f:x >Y X features Y target classes Y 2 {0, 1} Loss function: Expected loss of f: floaty Ex LI S HH t't 3 Ex E LHE fait YS IX xd IT Hx ti IPA i IX x FEE MY it x c I P Y fact I x Suppose you know P(Y X) exactly, how should you classify? Bayes optimal classifier: 5

Binary Classification Learn: f:x >Y X features Y target classes PITT Xix Y 2 {0, 1} Loss function: `(f(x),y)=1{f(x) 6= y} Expected loss of f: Suppose you know P(Y X) exactly, how should you classify? Bayes optimal classifier: E XY [1{f(X) 6= Y }] =E X [E Y X [1{f(x) 6= Y } X = x]] E Y X [1{f(x) 6= Y } X = x] = X i f(x) = arg max y P (Y = i X = x)1{f(x) 6= i} = X i6=f(x) P(Y = y X = x) P (Y = i X = x) =1 P (Y = f(x) X = x) 6

Link Functions Estimating P(Y X): O Why not use standard linear regression? PI w We need a function that maps XERD or Combining regression and probability? Need a mapping from real values to [0,1] A link function! O 7

Logistic Regression Logistic function (or Sigmoid): 0 Learn P(Y X) directly Assume a particular functional form for link function Sigmoid applied to a linear function of the input features: Z Features can be discrete or continuous! 8

Understanding the sigmoid w 0 =-2, w 1 =-1 w 0 =0, w 1 =-1 w 0 =0, w 1 =-0.5 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 9

Sigmoid for binary classes P(Y =0 w, X) = 1 1 + exp(w 0 + P k w kx k ) Ex P(Y =1 w, X) =1 P(Y =0 w, X) = exp(w 0 + P k w kx k ) 1 + exp(w 0 + P k w kx k ) P(Y =1 w, X) P(Y =0 w, X) = exp wot WTX I logl t w wix I 10

Sigmoid for binary classes P(Y =0 w, X) = 1 1 + exp(w 0 + P k w kx k ) P(Y =1 w, X) =1 P(Y =0 w, X) = exp(w 0 + P k w kx k ) 1 + exp(w 0 + P k w kx k ) P(Y =1 w, X) P(Y =0 w, X) =exp(w 0 + X k w k X k ) log P(Y =1 w, X) P(Y =0 w, X) = w 0 + X k w k X k Linear Decision Rule! 11

Logistic Regression a Linear classifier 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 Wo W3c O i i 12

Loss function: Conditional Likelihood Have a bunch of iid data of the form: This is equivalent to: P (Y = 1 x, w) = I 1 1 + exp(w T x) P (Y =1 x, w) = exp(wt x) 1 + exp(w T x) P (Y = y x, w) = So we can compute the maximum likelihood estimator: bw MLE = arg max w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} 1 1 + exp( yw T x) ny P (y i x i,w) i=1 H u 13

Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 y i x T i w)) 1 1 + exp( yw T x) 14

Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y i x i,w) i=1 nx log(1 + exp( i=1 Logistic Loss: `i(w) = log(1 + exp( 1 P (Y = y x, w) e= 1 + exp( yw T x) y i x T i w)) y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 (MLE for Gaussian noise) 15

Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 8 yixiw What does J(w) look like? Is it convex? d Gz I y i x T i w)) = J(w) oy Itexpc z Ii 1 1 + exp( yw T x) S f is convex if f beta Dg E fix cci d fly 16

Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 y i x T i w)) = J(w) 1 1 + exp( yw T x) Good news: J(w) is convex function of w, no local optima problems Bad news: no closed-form solution to maximize J(w) Good news: convex functions easy to optimize 17

Linear Separability arg min w nx log(1 + exp( y i x T i w)) When is this loss small? i=1 18

Large parameters Overfitting o O O If data is linearly separable, weights go to infinity In general, leads to overfitting: Penalizing high weights can prevent overfitting 19

Regularized Conditional Log Likelihood Add regularization penalty, e.g., L 2 : nx arg min log 1 + exp( y i (x T i w + b)) + w 2 2 w,b i=1 Be sure to not regularize the o set b! 20

Gradient Descent Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 21

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx `i(w) i=1 22

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. x x or y nx `i(w) i=1 g is a subgradient at x if D f(y) f(x)+g T (y x) f convex: f ( x +(1 )y) apple f(x)+(1 )f(y) 8x, y, 2 [0, 1] f(y) f(x)+rf(x) T (y x) 8x, y 23

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: nx `i(w) Each `i(w) is convex. 0 i=1 Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 24

Least squares Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx `i(w) i=1 Squared error Loss: `i(w) =(y i x T i w)2 How does software solve: Find Ax b 1 2 Xw y 2 2 I Cxtx w XTy 25

Least squares Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx `i(w) i=1 Squared error Loss: `i(w) =(y i x T i w)2 How does software solve: its complicated: (LAPACK, BLAS, MKL ) 0 1 2 Xw y 2 2 Do you need high precision? Is X column/row sparse? Is bw LS sparse? Is X T X well-conditioned? Can X T X fit in cache/memory? 26

Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) + 1 2 f 00 (x) 2 +... Gradient descent: Initialize 36 fly 0 randomly comet y Ie VtY G 9 in fees t y 27

Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v + 1 2 vt r 2 f(x)v +... Gradient descent: Key Xe Joffre 28

Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) rf(w) = XT Xu y xtxw XT Wen we 2 XT Xue y En We Z XTX we 2 XTy I Z Xix wt t ZXT y Wat Wg I Z Xix wt w 2XtXw 29

Z XTxw 2 x Ty Z XT yw ty Z P f f w O

Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) (w t+1 w )=(I X T X)(w t w ) Example: X= =(I X T X) t+1 (w 0 w ) apple 10 3 0 0 1 O y= apple 10 3 1 w 0 = apple 0 0 w = l 3 xtx of 9 D diagonal Dk hthpog.gg wee w two w abs value L wet z w l Z Wo z Wee z 30

a lzc

Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) + 1 2 f 00 (x) 2 +... Newton s method: µ if f'cx y i 2 f Cdc O y x 31

Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v + 1 2 vt r 2 f(x)v +... Newton s method: Xf Xt t Z Ve Ve I Hajj Offx 32

Newton s Method f(w) = 1 2 Xw y 2 2 rf(w) = r 2 f(w) = v t is solution to : r 2 f(w t )v t = rf(w t ) w t+1 = w t + v t 33

Newton s Method f(w) = 1 2 Xw y 2 2 rf(w) = X T (Xw y) r 2 f(w) = X T X v t is solution to : r 2 f(w t )v t = rf(w t ) w t+1 = w t + v t For quadratics, Newton s method converges in one step! (Not a surprise, why?) w 1 = w 0 (X T X) 1 X T (Xw 0 y)=w 34

General case In general for Newton s method to achieve f(w t ) f(w ) apple : So why are ML problems overwhelmingly solved by gradient methods? Hint: v t is solution to : r 2 f(w t )v t = rf(w t ) 35

General Convex case f(w t ) f(w ) apple Clean converge nice proofs: Bubeck Newton s method: t log(log(1/ )) Gradient descent: f is smooth and strongly convex: ai r 2 f(w) : bi f is smooth: r 2 f(w) bi f is potentially non-differentiable: rf(w) 2 apple c Nocedal +Wright, Bubeck Other: BFGS, Heavy-ball, BCD, SVRG, ADAM, Adagrad, 36

Revisiting Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 37

Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w f(w) rf(w) = = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 y i x T i w)) 1 1 + exp( yw T x) 38