Classification Logistic Regression

Size: px

Start display at page:

Download "Classification Logistic Regression"

Gregory Carson
5 years ago
Views:

1 O due Thursday µtwl Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16,

2 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS 2

3 Weather prediction revisted Temperature 0 3

4 Reading Your Brain, Simple Example Pairwise classification accuracy: 85% Person Animal [Mitchell et al.] 4

5 Binary Classification O Learn: f:x >Y X features Y target classes Y 2 {0, 1} Loss function: Expected loss of f: floaty Ex LI S HH t't 3 Ex E LHE fait YS IX xd IT Hx ti IPA i IX x FEE MY it x c I P Y fact I x Suppose you know P(Y X) exactly, how should you classify? Bayes optimal classifier: 5

6 Binary Classification Learn: f:x >Y X features Y target classes PITT Xix Y 2 {0, 1} Loss function: `(f(x),y)=1{f(x) 6= y} Expected loss of f: Suppose you know P(Y X) exactly, how should you classify? Bayes optimal classifier: E XY [1{f(X) 6= Y }] =E X [E Y X [1{f(x) 6= Y } X = x]] E Y X [1{f(x) 6= Y } X = x] = X i f(x) = arg max y P (Y = i X = x)1{f(x) 6= i} = X i6=f(x) P(Y = y X = x) P (Y = i X = x) =1 P (Y = f(x) X = x) 6

7 Link Functions Estimating P(Y X): O Why not use standard linear regression? PI w We need a function that maps XERD or Combining regression and probability? Need a mapping from real values to [0,1] A link function! O 7

8 Logistic Regression Logistic function (or Sigmoid): 0 Learn P(Y X) directly Assume a particular functional form for link function Sigmoid applied to a linear function of the input features: Z Features can be discrete or continuous! 8

9 Understanding the sigmoid w 0 =-2, w 1 =-1 w 0 =0, w 1 =-1 w 0 =0, w 1 =

10 Sigmoid for binary classes P(Y =0 w, X) = exp(w 0 + P k w kx k ) Ex P(Y =1 w, X) =1 P(Y =0 w, X) = exp(w 0 + P k w kx k ) 1 + exp(w 0 + P k w kx k ) P(Y =1 w, X) P(Y =0 w, X) = exp wot WTX I logl t w wix I 10

11 Sigmoid for binary classes P(Y =0 w, X) = exp(w 0 + P k w kx k ) P(Y =1 w, X) =1 P(Y =0 w, X) = exp(w 0 + P k w kx k ) 1 + exp(w 0 + P k w kx k ) P(Y =1 w, X) P(Y =0 w, X) =exp(w 0 + X k w k X k ) log P(Y =1 w, X) P(Y =0 w, X) = w 0 + X k w k X k Linear Decision Rule! 11

12 Logistic Regression a Linear classifier Wo W3c O i i 12

13 Loss function: Conditional Likelihood Have a bunch of iid data of the form: This is equivalent to: P (Y = 1 x, w) = I exp(w T x) P (Y =1 x, w) = exp(wt x) 1 + exp(w T x) P (Y = y x, w) = So we can compute the maximum likelihood estimator: bw MLE = arg max w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} exp( yw T x) ny P (y i x i,w) i=1 H u 13

14 Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 y i x T i w)) exp( yw T x) 14

15 Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y i x i,w) i=1 nx log(1 + exp( i=1 Logistic Loss: `i(w) = log(1 + exp( 1 P (Y = y x, w) e= 1 + exp( yw T x) y i x T i w)) y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 (MLE for Gaussian noise) 15

16 Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 8 yixiw What does J(w) look like? Is it convex? d Gz I y i x T i w)) = J(w) oy Itexpc z Ii exp( yw T x) S f is convex if f beta Dg E fix cci d fly 16

17 Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 y i x T i w)) = J(w) exp( yw T x) Good news: J(w) is convex function of w, no local optima problems Bad news: no closed-form solution to maximize J(w) Good news: convex functions easy to optimize 17

18 Linear Separability arg min w nx log(1 + exp( y i x T i w)) When is this loss small? i=1 18

19 Large parameters Overfitting o O O If data is linearly separable, weights go to infinity In general, leads to overfitting: Penalizing high weights can prevent overfitting 19

20 Regularized Conditional Log Likelihood Add regularization penalty, e.g., L 2 : nx arg min log 1 + exp( y i (x T i w + b)) + w 2 2 w,b i=1 Be sure to not regularize the o set b! 20

21 Gradient Descent Machine Learning CSE546 Kevin Jamieson University of Washington October 16,

22 Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx `i(w) i=1 22

23 Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. x x or y nx `i(w) i=1 g is a subgradient at x if D f(y) f(x)+g T (y x) f convex: f ( x +(1 )y) apple f(x)+(1 )f(y) 8x, y, 2 [0, 1] f(y) f(x)+rf(x) T (y x) 8x, y 23

24 Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: nx `i(w) Each `i(w) is convex. 0 i=1 Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 24

25 Least squares Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx `i(w) i=1 Squared error Loss: `i(w) =(y i x T i w)2 How does software solve: Find Ax b 1 2 Xw y 2 2 I Cxtx w XTy 25

26 Least squares Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx `i(w) i=1 Squared error Loss: `i(w) =(y i x T i w)2 How does software solve: its complicated: (LAPACK, BLAS, MKL ) Xw y 2 2 Do you need high precision? Is X column/row sparse? Is bw LS sparse? Is X T X well-conditioned? Can X T X fit in cache/memory? 26

27 Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) f 00 (x) Gradient descent: Initialize 36 fly 0 randomly comet y Ie VtY G 9 in fees t y 27

28 Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v vt r 2 f(x)v +... Gradient descent: Key Xe Joffre 28

29 Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) rf(w) = XT Xu y xtxw XT Wen we 2 XT Xue y En We Z XTX we 2 XTy I Z Xix wt t ZXT y Wat Wg I Z Xix wt w 2XtXw 29

30 Z XTxw 2 x Ty Z XT yw ty Z P f f w O

31 Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) (w t+1 w )=(I X T X)(w t w ) Example: X= =(I X T X) t+1 (w 0 w ) apple O y= apple w 0 = apple 0 0 w = l 3 xtx of 9 D diagonal Dk hthpog.gg wee w two w abs value L wet z w l Z Wo z Wee z 30

32 a lzc

33 Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) f 00 (x) Newton s method: µ if f'cx y i 2 f Cdc O y x 31

34 Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v vt r 2 f(x)v +... Newton s method: Xf Xt t Z Ve Ve I Hajj Offx 32

35 Newton s Method f(w) = 1 2 Xw y 2 2 rf(w) = r 2 f(w) = v t is solution to : r 2 f(w t )v t = rf(w t ) w t+1 = w t + v t 33

36 Newton s Method f(w) = 1 2 Xw y 2 2 rf(w) = X T (Xw y) r 2 f(w) = X T X v t is solution to : r 2 f(w t )v t = rf(w t ) w t+1 = w t + v t For quadratics, Newton s method converges in one step! (Not a surprise, why?) w 1 = w 0 (X T X) 1 X T (Xw 0 y)=w 34

37 General case In general for Newton s method to achieve f(w t ) f(w ) apple : So why are ML problems overwhelmingly solved by gradient methods? Hint: v t is solution to : r 2 f(w t )v t = rf(w t ) 35

38 General Convex case f(w t ) f(w ) apple Clean converge nice proofs: Bubeck Newton s method: t log(log(1/ )) Gradient descent: f is smooth and strongly convex: ai r 2 f(w) : bi f is smooth: r 2 f(w) bi f is potentially non-differentiable: rf(w) 2 apple c Nocedal +Wright, Bubeck Other: BFGS, Heavy-ball, BCD, SVRG, ADAM, Adagrad, 36

39 Revisiting Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16,

40 Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w f(w) rf(w) = = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 y i x T i w)) exp( yw T x) 38

Classification Logistic Regression

Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS 2 Weather prediction