Machine Learning: Logistic Regression. Lecture 04

Size: px

Start display at page:

Download "Machine Learning: Logistic Regression. Lecture 04"

Kimberly Logan
6 years ago
Views:

1 Machie Learig: Logistic Regressio Razva C. Buescu School of Electrical Egieerig ad Computer Sciece

2 Supervised Learig ask = lear a uko fuctio t : X that maps iput istaces x Î X to output targets tx Î : Classificatio: he output tx Î is oe of a fiite set of discrete categories. Regressio: he output tx Î is cotiuous, or has a cotiuous compoet. arget fuctio tx is ko oly through oisy set of traiig examples: x 1,t 1, x 2,t 2, x,t

3 Supervised Learig raiig raiig Examples x k, t k Learig Algorithm Model h estig est Examples x, t Model h Geeralizatio Performace

4 Parametric Approaches to Supervised Learig ask = build a fuctio hx such that: h matches t ell o the traiig data: => h is able to fit data that it has see. h also matches t ell o test data: => h is able to geeralize to usee data. ask = choose h from a ice class of fuctios that deped o a vector of parameters : hx º h x º h,x hat classes of fuctios are ice?

5 euros Soma is the cetral part of the euro: here the iput sigals are combied. Dedrites are cellular extesios: here majority of the iput occurs. Axo is a fie, log projectio: carries erve sigals to other euros. Syapses are molecular structures betee axo termials ad other euros: here the commuicatio takes place.

6 euro Models

7 Spikig/LIF euro Fuctio

8 euro Models

9 McCulloch-Pitts euro Fuctio x activatio / output fuctio x 1 x Σ i x i f h x x 3 Algebraic iterpretatio: he output of the euro is a liear combiatio of iputs from other euros, rescaled by the syaptic eights. eights i correspod to the syaptic eights activatig or ihibitig. summatio correspods to combiatio of sigals i the soma. It is ofte trasformed through a activatio / output fuctio.

Activatio Fuctios " $ uit step f z = # %$ Perceptro 0 if z < 0 1 if z 0 1

10 Activatio Fuctios " $ uit step f z = # %$ Perceptro 0 if z < 0 1 if z logistic f z = 1+ e z Logistic Regressio idetity f z = z Liear Regressio 0

11 Liear Regressio x activatio / output fuctio x Σ f x 2 3 i x i f z = z h x = i x i x 3 Polyomial curve fittig is Liear Regressio: x = φx = [1, x, x 2,..., x M ] hx = x

12 McCulloch-Pitts euro Fuctio x activatio / output fuctio x 1 x Σ i x i f h x x 3 Algebraic iterpretatio: he output of the euro is a liear combiatio of iputs from other euros, rescaled by the syaptic eights. eights i correspod to the syaptic eights activatig or ihibitig. summatio correspods to combiatio of sigals i the soma. It is ofte trasformed through a mootoic activatio / output fuctio.

13 Logistic Regressio x 0 x 1 x 2 x Σ activatio fuctio f i x i 1 h x = 3 1 f z = 1+ exp x 1+ exp z raiig set is x 1,t 1, x 2,t 2, x,t. x = [1, x 1, x 2,..., x k ] hx = σ x Ca be used for both classificatio ad regressio: Classificatio: = {C 1, C 2 } = {1, 0}. Regressio: = [0, 1] i.e. output eeds to be ormalized.

14 Logistic Regressio for Biary Classificatio Model output ca be iterpreted as posterior class probabilities: pc 1 x = σ x = 1 1+ exp x pc 2 x =1 σ x = exp x 1+ exp x Ho do e trai a logistic regressio model? What error/cost fuctio to miimize?

15 Logistic Regressio Learig Learig = fidig the right parameters = [ 0, 1,, k ] Fid that miimizes a error fuctio E hich measures the misfit betee hx, ad t. Expect that hx, performig ell o traiig examples x Þ hx, ill perform ell o arbitrary test examples x Î X. Least Squares error fuctio? E = 1 2 =1 {hx, t } 2 Differetiable => ca use gradiet descet o-covex => ot guarateed to fid the global optimum

16 Maximum Likelihood raiig set is D = {áx, t ñ t Î {0,1}, Î 1 } Let h = pc 1 x h = pt =1 x = σ x Maximum Likelihood ML priciple: fid parameters that maximize the likelihood of the labels. he likelihood fuctio is pt = h t 1 h 1 t =1 he egative log-likelihood cross etropy error fuctio: { } E = l pt x = t lh + 1 t l1 h =1

17 Maximum Likelihood Learig for Logistic Regressio he ML solutio is: ML = argmax pt = argmi E covex i ML solutio is give by ÑE = 0. Caot solve aalytically => solve umerically ith gradiet based methods: stochastic gradiet descet, cojugate gradiet, L-BFGS, etc. Gradiet is prove it: E = =1 h t x

18 Regularized Logistic Regressio Use a Gaussia prior over the parameters: = [ 0, 1,, M ] Bayes heorem: MAP solutio: þ ý ü î í ì - ø ö ç è æ = = + - I 0 M Ν p 2 exp 2, 2 1 / 1 a p a a t t t t p p p p p p µ = max arg t p MAP =

19 Regularized Logistic Regressio MAP solutio: = MAP arg max p t = arg max p t p = arg mi- l p t p = arg mi- l p t - l p = arg mi E D - l p a = arg mi E D + 2 = arg mie D + E E E D = - å{ t l y tl1 - y } = 1 a = 2 regularizatio term data term

20 Regularized Logistic Regressio MAP solutio: MAP = arg mi E D + E still covex i ML solutio is give by ÑE = 0. ÑE = ÑE D + ÑE Caot solve aalytically => solve umerically: = h t x +α =1 stochastic gradiet descet [PRML 3.1.3], eto Raphso iterative optimizatio [PRML 4.3.3], cojugate gradiet, LBFGS. here h = σ x

21 Softmax Regressio = Logistic Regressio for Multiclass Classificatio Multiclass classificatio: = {C 1, C 2,..., C K } = {1, 2,..., K}. raiig set is x 1,t 1, x 2,t 2, x,t. x = [1, x 1, x 2,..., x M ] t 1, t 2, t Î {1, 2,..., K} Oe eight vector per class [PRML 4.3.4]: pc k x = exp k x exp j x j

22 Softmax Regressio K ³ 2 Iferece: C* = arg max p Ck x C k = argmax C k exp k x exp j x j Zx a ormalizatio costat raiig usig: = argmax C k exp k x = argmax C k k x Maximum Likelihood ML Maximum A Posteriori MAP ith a Gaussia prior o.

23 Softmax Regressio he egative log-likelihood error fuctio is: E D = 1 l pt x = 1 = 1 =1 =1 l exp t x Zx K =1 k=1 δ k t l exp k x Zx covex i here d x t = ì1 í î0 x x = ¹ t t is the Kroecker delta fuctio.

24 Softmax Regressio he ML solutio is: = arg mi ML E D he gradiet is prove it: k E D = 1 = 1 =1 =1 δ k t pc k x x δ k t exp k x Zx x E D = " # E 1 D, 2 E D,, K E D $ %

25 Regularized Softmax Regressio he e cost fuctio is: E = E D + E = 1 K δ k t l exp k x + α Zx 2 =1 k=1 K k=1 k k he e gradiet is prove it: k E = 1 =1 δ k t pc k x x +α k

26 Softmax Regressio ML solutio is give by ÑE D = 0. Caot solve aalytically. Solve umerically, by plugig [cost, gradiet] = [E D, ÑE D ] values ito geeral covex solvers: L-BFGS eto methods cojugate gradiet stochastic / miibatch gradiet-based methods. gradiet descet ith / ithout mometum. AdaGrad, AdaDelta RMSProp ADAM,...

27 Implemetatio eed to compute [cost, gradiet]: cost = 1 gradiet k δ k t l pc k x + α 2 =1 k=1 => eed to compute, for k = 1,..., K: K = 1 =1 K k=1 k k δ k t pc k x x +α k output pc k x = exp k x exp j x j Overflo he k x are too large.

28 Implemetatio: Prevetig Overflos Subtract from each product k x the maximum product: c = max k x 1 k K pc k x = exp k x c exp j x c j

29 Implemetatio: Gradiet Checkig Wat to miimize Jθ, here θ is a scalar. Mathematical defiitio of derivative: d dθ J θ = lim * J θ + ε Jθ ε 2ε umerical approximatio of derivative: d Jθ +ε Jθ ε Jθ dθ 2ε here ε =

30 Implemetatio: Gradiet Checkig If θ is a vector of parameters θ i, Compute umerical derivative ith respect to each θ i. Create a vector v that is ε i positio i ad 0 everyhere else: Ho do you do this ithout a for loop i umpy? Compute G um θ i = Jθ +v Jθ v / 2ε Aggregate all derivatives ito umerical gradiet G um θ. Compare umerical gradiet G um θ ith implemetatio of gradiet G imp θ: G um θ G imp θ G um θ+ G imp θ 10 6

31 Implemetatio: Vectorizatio of LR Versio 1: Compute gradiet compoet-ise. E = =1 h t x Assume example x is stored i colum X[:,] i data matrix X. grad = p.zerosk for i rage: h = sigmoid.dotx[:,] temp = h t[] for k i ragek: grad[k] = grad[k] + temp * X[k,] def sigmoidx: retur 1 / 1 + p.exp x Lecture 03

32 Implemetatio: Vectorizatio of LR Versio 2: Compute gradiet, partially vectorized. E = =1 h t x grad = p.zerosk for i rage: grad = grad + sigmoid.dotx[] t[] * X[] def sigmoidx: retur 1 / 1 + p.exp x Lecture 03

33 Implemetatio: Vectorizatio of LR Versio 3: Compute gradiet, vectorized. E = grad = X.dotsigmoid.dotX t =1 h t x def sigmoidx: retur 1 / 1 + p.exp x Lecture 03

34 Vectorizatio of Softmax eed to compute [cost, gradiet]: cost = 1 gradiet k K δ k t l pc k x + α 2 =1 k=1 = 1 =1 k k => compute groud truth matrix G such that G[k,] = δ k t K k=1 δ k t pc k x x +α k from scipy.sparse import coo_matrix groudruth = coo_matrixp.oes, dtype = p.uit8, labels, p.arage.toarray

35 Vectorizatio of Softmax Compute cost = 1 δ k t l pc k x + α 2 =1 K k=1 K k=1 k k Compute matrix of 3 4 x 6. Compute matrix of 3 4 x 6 c 6. Compute matrix of exp 3 4 x 6 c 6. Compute matrix of l pc 3 x 6. Compute log-likelihood.

36 Vectorizatio of Softmax Compute grad k = 1 =1 δ k t pc k x x +α k Gradiet = [grad 1 grad 2 grad K ] Compute matrix of pc 3 x 6. Compute matrix of gradiet of data term. Compute matrix of gradiet of regularizatio term.

37 Vectorizatio of Softmax Useful umpy fuctios: p.dot p.amax p.argmax p.exp p.sum p.log p.mea

38 import scipy scipy.sparse.coo_matrix groudruth = coo_matrixp.oesumcases, dtype = p.uit8, labels, p.arageumcases.toarray scipy.optimize: scipy.optimize.fmi_l_bfgs_b theta, _, _ = fmi_l_bfgs_bsoftmaxcost, theta, args = umclasses, iputsize, decay, images, labels, maxiter = 100, disp = 1 scipy.optimize.fmi_cg scipy.miimize Lecture 03

39 Multiclass Logistic Regressio K ³ 2 1 rai oe eight vector per class [PRML Chapter 4.3.4]: p C k x = å exp kj x exp j x j j 2 More geeral approach: p C - Iferece: k x = å exp j x, Ck exp j x, C C* = arg max p Ck x C k j j Lecture 07 39

40 Logistic Regressio K ³ 2 2 Iferece i more geeral approach: C* = arg max p Ck x = raiig usig: C Maximum Likelihood ML k arg max exp j x, C k C k å exp j x, C j j = arg max exp j x, C C k = arg max j x, C C k k Maximum A Posteriori MAP ith a Gaussia prior o. k Lecture 07 Zx the partitio fuctio. 40

41 Logistic Regressio K ³ 2 ith ML he egative log-likelihood error fuctio is: he gradiet is prove it: 41 Lecture 07 å Õ = = = - = - D Z t t p E 1 1, exp l l x x x j ú û ù ê ë é = Ñ M D D D D E E E E,,, 1 0!,, k i K k k i i D C C p t E x x x j åj åå = = = + = - mi arg ML E D = covex i

42 Logistic Regressio K ³ 2 ith ML Set ÑE D = 0 Þ ML solutio satisfies: åji x, t = åå = 1 K = 1 k= 1 p C k x j x i, C k Þ for every feature j i, the observed value o D should be the same as the expected value o D! Solve umerically: Stochastic gradiet descet [chapter 3.1.3]. eto Raphso iterative optimizatio large Hessia!. Limited memory eto methods e.g. L-BFGS. Lecture 07 42

43 he Maximum Etropy Priciple Priciple of Isufficiet Reaso Priciple of Idifferece ca be traced back to Pierre Laplace ad Jacob Beroulli. Ø A. L. Berger, S. A. Della Pietra, ad V. J. Della Pietra A maximum etropy approach to atural laguage processig. Computatioal Liguistics, 221. model all that is ko ad assume othig about that hich is uko. give a collectio of facts, choose a model cosistet ith all the facts, but otherise as uiform as possible. Lecture 07 43

44 Maximum Likelihood Û Maximum Etropy 1 Maximize coditioal likelihood: 2 Maximize coditioal etropy: subject to: Þ solutio is: 44 Õ Õ = = = = Z t t p p 1 1, exp x x x t j max arg t p ML = log arg max 1 1 k K k k p ME C p C p p x x åå = = - =,, k K k k C C p t x x x j åj åå = = = =, exp ML ME Z t t p t p ML x x x x j = = Lecture 07

Machine Learning: Logistic Regression. Lecture 04

Machine Learning: Logistic Regression Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Supervised Learning Task = learn an (unkon function t : X T that maps input