O due Thursday µtwl Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1
THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS 2
Weather prediction revisted Temperature 0 3
Reading Your Brain, Simple Example Pairwise classification accuracy: 85% Person Animal [Mitchell et al.] 4
Binary Classification O Learn: f:x >Y X features Y target classes Y 2 {0, 1} Loss function: Expected loss of f: floaty Ex LI S HH t't 3 Ex E LHE fait YS IX xd IT Hx ti IPA i IX x FEE MY it x c I P Y fact I x Suppose you know P(Y X) exactly, how should you classify? Bayes optimal classifier: 5
Binary Classification Learn: f:x >Y X features Y target classes PITT Xix Y 2 {0, 1} Loss function: `(f(x),y)=1{f(x) 6= y} Expected loss of f: Suppose you know P(Y X) exactly, how should you classify? Bayes optimal classifier: E XY [1{f(X) 6= Y }] =E X [E Y X [1{f(x) 6= Y } X = x]] E Y X [1{f(x) 6= Y } X = x] = X i f(x) = arg max y P (Y = i X = x)1{f(x) 6= i} = X i6=f(x) P(Y = y X = x) P (Y = i X = x) =1 P (Y = f(x) X = x) 6
Link Functions Estimating P(Y X): O Why not use standard linear regression? PI w We need a function that maps XERD or Combining regression and probability? Need a mapping from real values to [0,1] A link function! O 7
Logistic Regression Logistic function (or Sigmoid): 0 Learn P(Y X) directly Assume a particular functional form for link function Sigmoid applied to a linear function of the input features: Z Features can be discrete or continuous! 8
Understanding the sigmoid w 0 =-2, w 1 =-1 w 0 =0, w 1 =-1 w 0 =0, w 1 =-0.5 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 9
Sigmoid for binary classes P(Y =0 w, X) = 1 1 + exp(w 0 + P k w kx k ) Ex P(Y =1 w, X) =1 P(Y =0 w, X) = exp(w 0 + P k w kx k ) 1 + exp(w 0 + P k w kx k ) P(Y =1 w, X) P(Y =0 w, X) = exp wot WTX I logl t w wix I 10
Sigmoid for binary classes P(Y =0 w, X) = 1 1 + exp(w 0 + P k w kx k ) P(Y =1 w, X) =1 P(Y =0 w, X) = exp(w 0 + P k w kx k ) 1 + exp(w 0 + P k w kx k ) P(Y =1 w, X) P(Y =0 w, X) =exp(w 0 + X k w k X k ) log P(Y =1 w, X) P(Y =0 w, X) = w 0 + X k w k X k Linear Decision Rule! 11
Logistic Regression a Linear classifier 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 Wo W3c O i i 12
Loss function: Conditional Likelihood Have a bunch of iid data of the form: This is equivalent to: P (Y = 1 x, w) = I 1 1 + exp(w T x) P (Y =1 x, w) = exp(wt x) 1 + exp(w T x) P (Y = y x, w) = So we can compute the maximum likelihood estimator: bw MLE = arg max w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} 1 1 + exp( yw T x) ny P (y i x i,w) i=1 H u 13
Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 y i x T i w)) 1 1 + exp( yw T x) 14
Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y i x i,w) i=1 nx log(1 + exp( i=1 Logistic Loss: `i(w) = log(1 + exp( 1 P (Y = y x, w) e= 1 + exp( yw T x) y i x T i w)) y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 (MLE for Gaussian noise) 15
Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 8 yixiw What does J(w) look like? Is it convex? d Gz I y i x T i w)) = J(w) oy Itexpc z Ii 1 1 + exp( yw T x) S f is convex if f beta Dg E fix cci d fly 16
Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 y i x T i w)) = J(w) 1 1 + exp( yw T x) Good news: J(w) is convex function of w, no local optima problems Bad news: no closed-form solution to maximize J(w) Good news: convex functions easy to optimize 17
Linear Separability arg min w nx log(1 + exp( y i x T i w)) When is this loss small? i=1 18
Large parameters Overfitting o O O If data is linearly separable, weights go to infinity In general, leads to overfitting: Penalizing high weights can prevent overfitting 19
Regularized Conditional Log Likelihood Add regularization penalty, e.g., L 2 : nx arg min log 1 + exp( y i (x T i w + b)) + w 2 2 w,b i=1 Be sure to not regularize the o set b! 20
Gradient Descent Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 21
Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx `i(w) i=1 22
Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. x x or y nx `i(w) i=1 g is a subgradient at x if D f(y) f(x)+g T (y x) f convex: f ( x +(1 )y) apple f(x)+(1 )f(y) 8x, y, 2 [0, 1] f(y) f(x)+rf(x) T (y x) 8x, y 23
Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: nx `i(w) Each `i(w) is convex. 0 i=1 Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 24
Least squares Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx `i(w) i=1 Squared error Loss: `i(w) =(y i x T i w)2 How does software solve: Find Ax b 1 2 Xw y 2 2 I Cxtx w XTy 25
Least squares Have a bunch of iid data of the form: {(x i,y i )} n i=1 x i 2 R d y i 2 R Learning a model s parameters: Each `i(w) is convex. nx `i(w) i=1 Squared error Loss: `i(w) =(y i x T i w)2 How does software solve: its complicated: (LAPACK, BLAS, MKL ) 0 1 2 Xw y 2 2 Do you need high precision? Is X column/row sparse? Is bw LS sparse? Is X T X well-conditioned? Can X T X fit in cache/memory? 26
Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) + 1 2 f 00 (x) 2 +... Gradient descent: Initialize 36 fly 0 randomly comet y Ie VtY G 9 in fees t y 27
Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v + 1 2 vt r 2 f(x)v +... Gradient descent: Key Xe Joffre 28
Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) rf(w) = XT Xu y xtxw XT Wen we 2 XT Xue y En We Z XTX we 2 XTy I Z Xix wt t ZXT y Wat Wg I Z Xix wt w 2XtXw 29
Z XTxw 2 x Ty Z XT yw ty Z P f f w O
Gradient Descent f(w) = 1 2 Xw y 2 2 w t+1 = w t rf(w t ) (w t+1 w )=(I X T X)(w t w ) Example: X= =(I X T X) t+1 (w 0 w ) apple 10 3 0 0 1 O y= apple 10 3 1 w 0 = apple 0 0 w = l 3 xtx of 9 D diagonal Dk hthpog.gg wee w two w abs value L wet z w l Z Wo z Wee z 30
a lzc
Taylor Series Approximation Taylor series in one dimension: f(x + )=f(x)+f 0 (x) + 1 2 f 00 (x) 2 +... Newton s method: µ if f'cx y i 2 f Cdc O y x 31
Taylor Series Approximation Taylor series in d dimensions: f(x + v) =f(x)+rf(x) T v + 1 2 vt r 2 f(x)v +... Newton s method: Xf Xt t Z Ve Ve I Hajj Offx 32
Newton s Method f(w) = 1 2 Xw y 2 2 rf(w) = r 2 f(w) = v t is solution to : r 2 f(w t )v t = rf(w t ) w t+1 = w t + v t 33
Newton s Method f(w) = 1 2 Xw y 2 2 rf(w) = X T (Xw y) r 2 f(w) = X T X v t is solution to : r 2 f(w t )v t = rf(w t ) w t+1 = w t + v t For quadratics, Newton s method converges in one step! (Not a surprise, why?) w 1 = w 0 (X T X) 1 X T (Xw 0 y)=w 34
General case In general for Newton s method to achieve f(w t ) f(w ) apple : So why are ML problems overwhelmingly solved by gradient methods? Hint: v t is solution to : r 2 f(w t )v t = rf(w t ) 35
General Convex case f(w t ) f(w ) apple Clean converge nice proofs: Bubeck Newton s method: t log(log(1/ )) Gradient descent: f is smooth and strongly convex: ai r 2 f(w) : bi f is smooth: r 2 f(w) bi f is potentially non-differentiable: rf(w) 2 apple c Nocedal +Wright, Bubeck Other: BFGS, Heavy-ball, BCD, SVRG, ADAM, Adagrad, 36
Revisiting Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 37
Loss function: Conditional Likelihood Have a bunch of iid data of the form: bw MLE = arg max w f(w) rf(w) = = arg min w {(x i,y i )} n i=1 x i 2 R d, y i 2 { 1, 1} ny P (y P (Y = y x, w) = i x i,w) i=1 nx log(1 + exp( i=1 y i x T i w)) 1 1 + exp( yw T x) 38